Systems and methods for determining nucleic acids

ABSTRACT

The present invention generally relates to systems and methods for imaging or determining nucleic acids, for instance, within cells. In some embodiments, the transcriptome of a cell may be determined. Certain embodiments are directed to determining nucleic acids, such as mRNA, within cells at relatively high resolutions. In some embodiments, a plurality of nucleic acid probes may be applied to a sample, and their binding within the sample determined, e.g., using fluorescence, to determine locations of the nucleic acid probes within the sample. In some embodiments, codewords may be based on the binding of the plurality of nucleic acid probes, and in some cases, the codewords may define an error-correcting code to reduce or prevent misidentification of the nucleic acids. In certain cases, a relatively large number of different targets may be identified using a relatively small number of labels, e.g., by using various combinatorial approaches.

RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No.17/374,000, filed Jul. 13, 2021, entitled “Systems and Methods forDetermining Nucleic Acids,” which is a divisional of U.S. applicationSer. No. 15/329,683, filed Jan. 27, 2017, entitled “Systems and Methodsfor Determining Nucleic Acids,” which is a national stage filing ofInternational Patent Application Serial No. PCT/US2015/042556, whichclaims the benefit of U.S. Provisional Patent Application Ser. No.62/031,062, filed Jul. 30, 2014, entitled “Systems and Methods forDetermining Nucleic Acids,” by Zhuang, et al.; U.S. Provisional PatentApplication Ser. No. 62/142,653, filed Apr. 3, 2015, entitled “Systemsand Methods for Determining Nucleic Acids,” by Zhuang, et al.; and U.S.Provisional Patent Application Ser. No. 62/050,636, filed Sep. 15, 2014,entitled “Probe Library Construction,” by Zhuang, et al. Each of theabove is incorporated herein by reference.

GOVERNMENT FUNDING

This invention was made with government support under Grant No. GM096450awarded by National Institutes of Health. The government has certainrights in the invention.

REFERENCE TO AN ELECTRONIC SEQUENCE LISTING

The contents of the electronic sequence listing(H049870509US04-SUBSEQ-TC.xml; Size: 52,293 bytes; and Date of Creation:May 17, 2023) is herein incorporated by reference in its entirety.

FIELD

The present invention generally relates to systems and methods forimaging or determining nucleic acids, for instance, within cells. Insome embodiments, the transcriptome of a cell may be determined.

BACKGROUND

Single-molecule fluorescent in situ hybridization (smFISH) is a powerfulmethod for detecting individual mRNA molecules in cells. The highdetection efficiency and large dynamic range of this method providesexquisite detail into the expression state, spatial distribution withincells and intact tissues, and variation among cells of individual mRNAs.Such approaches have been essential to many recent insights intounderstanding gene regulation and expression. A fundamental limitationof smFISH, however, is its low throughput, typically only a few genes ata time. This low throughput is due to a lack of distinguishable probeswith which to label cells and the cost of producing large amounts oflabeled probe required for high efficiency staining. Thus, improvementsin detecting mRNA molecules are needed.

SUMMARY

The present invention generally relates to systems and methods forimaging or determining nucleic acids, for instance, within cells. Insome embodiments, the transcriptome of a cell may be determined. Thesubject matter of the present invention involves, in some cases,interrelated products, alternative solutions to a particular problem,and/or a plurality of different uses of one or more systems and/orarticles.

In one aspect, the present invention is generally directed to acomposition. According to one set of embodiments, the compositioncomprises a plurality of nucleic acid probes, at least some of whichcomprise a first portion comprising a target sequence and a plurality ofread sequences. In some cases, each comprises a first portion comprisinga target sequence and a plurality of read sequences. In someembodiments, the plurality of read sequences are distributed on theplurality of nucleic acid probes so as to define an error-correctingcode.

In another aspect, the present invention is generally directed to amethod. In one set of embodiments, the method includes acts of exposinga sample to a plurality of nucleic acid probes; for each of the nucleicacid probes, determining binding of the nucleic acid probes within thesample; creating codewords based on the binding of the nucleic acidprobes; and for at least some of the codewords, matching the codeword toa valid codeword wherein, if no match is found, applying errorcorrection to the codeword to form a valid codeword.

The method, in another set of embodiments, includes acts of exposing asample to a plurality of nucleic acid probes, wherein the nucleic acidprobes comprise a first portion comprising a target sequence and asecond portion comprising one or more read sequences, and wherein atleast some of the plurality of nucleic acid probes comprisesdistinguishable nucleic acid probes formed from combinatorialcombination of one or more read sequences taken from a plurality of readsequences; and for each of the nucleic acid probes, determining bindingof the target sequences of the nucleic acid probes within the sample.

In yet another set of embodiments, the method includes acts of exposinga sample to a plurality of primary nucleic acid probes (also calledencoding probes); exposing the plurality of primary nucleic acid probesto a sequence of secondary nucleic acid probes (also called readoutprobes) and determining fluorescence of each of the secondary nucleicacid probes within the sample; creating codewords based on fluorescenceof the secondary nucleic acid probes; and for at least some of thecodewords, matching the codeword to a valid codeword wherein, if nomatch is found, applying error correction to the codeword to form avalid codeword.

In one set of embodiments, the method includes acts of exposing aplurality of primary nucleic acid probes to a sample; and exposing theplurality of nucleic acid probes to a sequence of secondary nucleic acidprobes and determining fluorescence of each of the secondary probeswithin the sample. In some embodiments, at least some of the pluralityof secondary nucleic acid probes comprises distinguishable secondarynucleic acid probes formed from combinatorial combination of one or moreread sequences (or readout probe sequences) taken from a plurality ofread sequences (or readout probe sequences).

In another set of embodiments, the method comprises acts of exposing acell to a plurality of nucleic acid probes, exposing the plurality ofnucleic acid probes to a first secondary probe comprising a firstsignaling entity, determining the first signaling entity at a precisionbetter than 500 nm, inactivating the first signaling entity, exposingthe plurality of nucleic acid probes to a second secondary probecomprising a second signaling entity, and determining the secondsignaling entity at a precision better than 500 nm.

In another set of embodiments, the method comprises acts of exposing acell to a plurality of nucleic acid probes, exposing the plurality ofnucleic acid probes to a first secondary probe comprising a firstsignaling entity, determining the first signaling entity at a resolutionbetter than 100 nm, inactivating the first signaling entity, exposingthe plurality of nucleic acid probes to a second secondary probecomprising a second signaling entity, and determining the secondsignaling entity at a resolution better than 100 nm.

The method, in yet another set of embodiments, includes acts of exposinga cell to a plurality of nucleic acid probes, exposing the plurality ofnucleic acid probes to a first secondary probe comprising a firstsignaling entity, determining the first signaling entity using asuper-resolution imaging technique, inactivating the first signalingentity, exposing the plurality of nucleic acid probes to a secondsecondary probe comprising a second signaling entity, and determiningthe second signaling entity using a super-resolution imaging technique.

In certain embodiments, the method comprises acts of associating aplurality of targets with a plurality of target sequences and aplurality of codewords, wherein the codewords comprise a number ofpositions and values for each position, and the codewords form anerror-checking and/or error-correcting code space; associating aplurality of distinguishable read sequences with the plurality ofcodewords such that each distinguishable read sequence represents avalue of a position within the codewords; and forming a plurality ofnucleic acid probes, each comprising a target sequence and one or moreread sequences.

In addition, in one set of embodiments, the method includes acts ofassociating a plurality of targets with a plurality of target sequencesand a plurality of codewords, wherein the codewords comprise a number ofpositions and values for each position, and the codewords form anerror-checking and/or error-correcting code space; forming a pluralityof nucleic acid probes each comprising a target sequence; and forminggroups comprising the plurality of nucleic acid probes such that eachgroup of nucleic acid probes corresponds to at least one common value ofa position within the codewords.

In another set of embodiments, the method includes acts of associating aplurality of targets with a plurality of target sequences and aplurality of codewords, wherein the codewords comprise a number ofpositions that is less than the number of targets, and wherein eachcodeword is associated with a single target, associating a plurality ofdistinguishable read sequences with the plurality of codewords such thateach distinguishable read sequence represents a value of a positionwithin the codewords, and forming a plurality of nucleic acid probes,each comprising a target sequence and one or more read sequences.

The method, in still another set of embodiments, includes acts ofexposing a plurality of nucleic acid probes to a cell, exposing theplurality of nucleic acid probes to a sequence of secondary probes anddetermining fluorescence of each of the secondary probes within thecell, and based on the sequence of fluorescence of each of the secondaryprobes, determining nucleic acids within the cell.

In another set of embodiments, the method includes acts of associating aplurality of targets with a plurality of target sequences and aplurality of codewords, wherein the codewords comprise a number ofpositions and values for each position, and the codewords form anerror-checking and/or error-correcting code space; forming a pluralityof nucleic acid probes each comprising a target sequence; and forminggroups comprising the plurality of nucleic acid probes such that eachgroup of nucleic acid probes correspond to at least one common value ofa position within the codewords.

In yet another set of embodiments, the method includes acts of exposinga cell to a plurality of nucleic acid probes, exposing the plurality ofnucleic acid probes to a first secondary probe comprising a firstsignaling entity, determining the first signaling entity using asuper-resolution imaging technique, inactivating the first signalingentity, exposing the plurality of nucleic acid probes to a secondsecondary probe comprising a second signaling entity, and determiningthe second signaling entity using a super-resolution imaging technique.

In still another set of embodiments, the method includes acts ofexposing a cell to a plurality of nucleic acid probes, exposing theplurality of nucleic acid probes to a first secondary probe comprising afirst signaling entity, determining the first signaling entity at aprecision better than 500 nm, inactivating the first signaling entity,exposing the plurality of nucleic acid probes to a second secondaryprobe comprising a second signaling entity, and determining the secondsignaling entity at a precision better than 500 nm.

In still another set of embodiments, the method includes acts ofexposing a cell to a plurality of nucleic acid probes, exposing theplurality of nucleic acid probes to a first secondary probe comprising afirst signaling entity, determining the first signaling entity at aresolution better than 100 nm, inactivating the first signaling entity,exposing the plurality of nucleic acid probes to a second secondaryprobe comprising a second signaling entity, and determining the secondsignaling entity using a super-resolution imaging technique.

In still another set of embodiments, the method includes acts ofassociating a plurality of nucleic acid targets with a plurality oftarget sequences and a plurality of codewords, wherein the codewordscomprise a number of positions and values for each position, and thecodewords form an error-checking and/or error-correcting code;associating unique read sequences with each possible value of eachposition in the codewords, wherein the read sequences are taken from aset of orthogonal sequences, which have limited homology with oneanother and with the nucleic acid species in a sample; forming aplurality of primary nucleic acid probes each comprising a targetsequence that uniquely binds to a nucleic acid target and one or moreread sequences; forming a plurality of secondary nucleic acid probescomprising a signaling entity and a sequence that is complementary toone of the read sequences; exposing a sample to the primary nucleic acidprobes such that the nucleic acid probes hybridize to the nucleic acidtargets in the sample; exposing the primary nucleic acid probes in thesample to a secondary nucleic acid probe such that the secondary nucleicacid probe hybridizes to the read sequence on at least some of theprimary nucleic acid probes; imaging the sample; and repeating theexposing and imaging steps one or more times, using a differentsecondary nucleic acid probe for at least some of the repetitions.

The method, according to yet another set of embodiments, includes actsof associating a plurality of nucleic acid targets with a plurality oftarget sequences and a plurality of codewords wherein the codewordscomprise a number of positions and values for each position, and thecodewords form an error-checking and/or error-correcting code space;forming a plurality of nucleic acid probes comprising a signaling entityand a target sequence that uniquely binds to one of the nucleic acidtargets; grouping the nucleic acid probes into a plurality of probepools, wherein each of the probe pools corresponds to a specific valueof a unique position within the codewords; exposing a sample to one ofthe probe pools; imaging the sample; and repeating the exposing andimaging steps one or more times, using a different probe pool for atleast some of the repetitions.

In another aspect, the present invention encompasses methods of makingone or more of the embodiments described herein. In still anotheraspect, the present invention encompasses methods of using one or moreof the embodiments described herein.

Other advantages and novel features of the present invention will becomeapparent from the following detailed description of various non-limitingembodiments of the invention when considered in conjunction with theaccompanying figures. In cases where the present specification and adocument incorporated by reference include conflicting and/orinconsistent disclosure, the present specification shall control. If twoor more documents incorporated by reference include conflicting and/orinconsistent disclosure with respect to each other, then the documenthaving the later effective date shall control.

BRIEF DESCRIPTION OF THE DRAWINGS

Non-limiting embodiments of the present invention will be described byway of example with reference to the accompanying figures, which areschematic and are not intended to be drawn to scale. In the figures,each identical or nearly identical component illustrated is typicallyrepresented by a single numeral. For purposes of clarity, not everycomponent is labeled in every figure, nor is every component of eachembodiment of the invention shown where illustration is not necessary toallow those of ordinary skill in the art to understand the invention. Inthe figures:

FIGS. 1A-1C illustrates an encoding scheme for nucleic acid probes, incertain embodiments of the invention;

FIGS. 2A-2G illustrates determining of mRNAs in a cell, in someembodiments of the invention;

FIGS. 3A-3B illustrate the determination of nucleic acids, in accordancewith various embodiments of the invention;

FIGS. 4A-4B is a non-limiting example of multiple read sequencesdistributed in a population of different nucleic acid probes, inaccordance with certain embodiments of the invention;

FIGS. 5A-5E illustrate the determination of nucleic acids, in accordancewith another embodiment of the invention;

FIGS. 6A-6H illustrate simultaneous determination of multiple nucleicacid species in cells, in certain embodiments of the invention;

FIGS. 7A-7F show expression noise of genes and co-variation ofexpression between different genes determined in accordance with someembodiments of the invention;

FIGS. 8A-8E illustrate spatial distribution of RNAs in cells determinedin accordance with one embodiment of the invention;

FIGS. 9A-9C illustrate simultaneous determination of multiple nucleicacid species in cells, in another embodiment of the invention;

FIGS. 10A-10C show expression between different genes determined inaccordance with yet another embodiment of the invention;

FIG. 11 is a schematic description of combinatorial labeling, inaccordance with another embodiment of the invention;

FIGS. 12A-12C show schematic descriptions of Hamming distance, inanother embodiment of the invention;

FIG. 13 illustrates the production of a library of probes, in stillanother embodiment of the invention;

FIGS. 14A-14B illustrate fluorescent spot determinations, in anotherembodiment of the invention;

FIGS. 15A-15B illustrate error correction facilitates RNA detection, inyet another embodiment of the invention;

FIGS. 16A-16E show characterization of misidentification rates andcalling rates, in one embodiment of the invention;

FIGS. 17A-17D show characterization of misidentification rates andcalling rates, in another embodiment of the invention;

FIGS. 18A-18C show a comparison of experiments, in accordance with oneembodiment of the invention; and

FIGS. 19A-19D illustrate decoding and error assessment, in anotherembodiment of the invention.

FIGS. 20A-20H show the codebooks for certain experiments in anotherembodiment.

BRIEF DESCRIPTION OF THE SEQUENCES

SEQ ID NO: 1 is: GTTGGCGACGAAAGCACTGCGATTGGAACCGTCCCAAGCGTTGCGCTTAATGGATCATCAATTTTGTCTCACTACGACGGTCAATCGCGCTGCATACTTGCGTCGGTCGGACAAACGAGG; SEQ ID NO: 2 is CGCAACGCTTGGGACGGTTCCAATCGGATC;SEQ ID NO: 3 is CGAATGCTCTGGCCTCGAACGAACGATAGC; SEQ ID NO: 4 isACAAATCCGACCAGATCGGACGATCATGGG; SEQ ID NO: 5 isCAAGTATGCAGCGCGATTGACCGTCTCGTT; SEQ ID NO: 6 isTGCGTCGTCTGGCTAGCACGGCACGCAAAT; SEQ ID NO: 7 isAAGTCGTACGCCGATGCGCAGCAATTCACT; SEQ ID NO: 8 isCGAAACATCGGCCACGGTCCCGTTGAACTT; SEQ ID NO: 9 isACGAATCCACCGTCCAGCGCGTCAAACAGA; SEQ ID NO: 10 isCGCGAAATCCCCGTAACGAGCGTCCCTTGC; SEQ ID NO: 11 isGCATGAGTTGCCTGGCGTTGCGACGACTAA; SEQ ID NO: 12 isCCGTCGTCTCCGGTCCACCGTTGCGCTTAC; SEQ ID NO: 13 isGGCCAATGGCCCAGGTCCGTCACGCAATTT; SEQ ID NO: 14 isTTGATCGAATCGGAGCGTAGCGGAATCTGC; SEQ ID NO: 15 isCGCGCGGATCCGCTTGTCGGGAACGGATAC; SEQ ID NO: 16 isGCCTCGATTACGACGGATGTAATTCGGCCG; SEQ ID NO: 17 isGCCCGTATTCCCGCTTGCGAGTAGGGCAAT SEQ ID NO: 18 is GTTGGTCGGCACTTGGGTGC;SEQ ID NO: 19 is CGATGCGCCAATTCCGGTTC; SEQ ID NO: 20 isCGCGGGCTATATGCGAACCG; SEQ ID NO: 21 isTAATACGACTCACTATAGGGAAAGCCGGTTCATCCGGTGG; SEQ ID NO: 22 isTAATACGACTCACTATAGGGTGATCATCGCTCGCGGGTTG; SEQ ID NO: 23 isTAATACGACTCACTATAGGGCGTGGAGGGCATACAACGC; SEQ ID NO: 24 isCGCAACGCTTGGGACGGTTCCAATCGGATC/3Cy5Sp/; SEQ ID NO: 25 isCGAATGCTCTGGCCTCGAACGAACGATAGC/3Cy5Sp/; SEQ ID NO: 26 isACAAATCCGACCAGATCGGACGATCATGGG/3Cy5Sp/; SEQ ID NO: 27 isCAAGTATGCAGCGCGATTGACCGTCTCGTT/3Cy5Sp/; SEQ ID NO: 28 isGCGGGAAGCACGTGGATTAGGGCATCGACC/3Cy5Sp/; SEQ ID NO: 29 isAAGTCGTACGCCGATGCGCAGCAATTCACT/3Cy5Sp/; SEQ ID NO: 30 isCGAAACATCGGCCACGGTCCCGTTGAACTT/3Cy5Sp/; SEQ ID NO: 31 isACGAATCCACCGTCCAGCGCGTCAAACAGA/3Cy5Sp/; SEQ ID NO: 32 isCGCGAAATCCCCGTAACGAGCGTCCCTTGC/3Cy5Sp/; SEQ ID NO: 33 isGCATGAGTTGCCTGGCGTTGCGACGACTAA/3Cy5Sp/; SEQ ID NO: 34 isCCGTCGTCTCCGGTCCACCGTTGCGCTTAC/3Cy5Sp/; SEQ ID NO: 35 isGGCCAATGGCCCAGGTCCGTCACGCAATTT/3Cy5Sp/; SEQ ID NO: 36 isTTGATCGAATCGGAGCGTAGCGGAATCTGC/3Cy5Sp/; SEQ ID NO: 37 isCGCGCGGATCCGCTTGTCGGGAACGGATAC/3Cy5Sp/; SEQ ID NO: 38 isGCCTCGATTACGACGGATGTAATTCGGCCG/3Cy5Sp/; and SEQ ID NO: 39 isGCCCGTATTCCCGCTTGCGAGTAGGGCAAT/3Cy5Sp/

DETAILED DESCRIPTION

The present invention generally relates to systems and methods forimaging or determining nucleic acids, for instance, within cells. Insome embodiments, the transcriptome of a cell may be determined. Certainembodiments are directed to determining nucleic acids, such as mRNA,within cells at relatively high resolutions. In some embodiments, aplurality of nucleic acid probes may be applied to a sample, and theirbinding within the sample determined, e.g., using fluorescence, todetermine locations of the nucleic acid probes within the sample. Insome embodiments, codewords may be based on the binding of the pluralityof nucleic acid probes, and in some cases, the codewords may define anerror-correcting code to reduce or prevent misidentification of thenucleic acids. In certain cases, a relatively large number of differenttargets may be identified using a relatively small number of labels,e.g., by using various combinatorial approaches.

Two example approaches are now discussed. It should be understood,however, that these are presented by way of explanation and notlimitation; other aspects and embodiments are discussed in furtherdetail herein. In one example method, primary probes (also calledencoding probes) and secondary probes (also called readout probes) areused, where the primary probes encode “codewords” and bind to targetnucleic acids in the sample, and the secondary probes are used to readout the codewords from the primary probes. In another example method, aplurality of different primary probes containing codewords are dividedinto as many separate pools as there are positions in the codewords,such that each primary probe pool corresponds to a certain value in acertain position of the codewords (e.g., a “one” in the first positionas in “1001”).

The first example is now described with respect to FIG. 3A. As will bediscussed in more detail below, in other embodiments, otherconfigurations may be used as well. In this first example, a series ofnucleic acid probes are used to determine nucleic acids within a cell orother sample, e.g., qualitatively or quantitatively. For example,nucleic acids may be identified as being present or absent, and/or thenumbers or concentrations of certain nucleic acids may be determinedwithin the cell or other sample. In some cases, the positions of theprobes within the cell or other sample may be determined at relativelyhigh resolutions, and in some cases, at resolutions better than thewavelength of visible light.

This example is generally directed to spatially detecting nucleic acidswithin a cell or other sample, e.g., at relatively high resolutions. Forexample, the nucleic acids may be mRNAs, or other nucleic acidsdescribed herein. In one set of embodiments, the nucleic acids withinthe cell may be determined by delivering or applying nucleic acid probesto the cell. In some cases, by using combinatorial approaches, arelatively large number of nucleic acids may be determined using arelatively small number of different labels on the nucleic acid probes.Thus, for example, a relatively small number of experiments may be usedto determine a relatively large number of nucleic acids in a sample,e.g., due to simultaneous binding of the nucleic acid probes todifferent nucleic acids in the sample.

In one set of embodiments, a population of primary nucleic acid probesare applied to the cell (or other sample) that is able to bind nucleicacids suspected of being present within the cell. Afterwards,sequentially, secondary nucleic acid probes that can bind to orotherwise interact with some of the primary nucleic acids are added anddetermined, e.g., using imaging techniques such as fluorescencemicroscopy (e.g., conventional fluorescence microscopy), STORM(stochastic optical reconstruction microscopy) or other imagingtechniques. After imaging, the secondary nucleic acid probes areinactivated or removed, and a different secondary nucleic acid probe isadded to the sample. This may be repeated multiple times with multipledifferent secondary nucleic acid probes. The pattern of binding of thevarious secondary nucleic acid probes may be used to determine theprimary nucleic acid probes at locations within the cell or othersample, which can be used to determine mRNA or other nucleic acids thatare present.

For instance, as is shown in FIG. 3A, a population of nucleic acids 10within a cell (represented here by nucleic acids 11, 12, and 13) may beexposed to a population of primary nucleic acid probes 20, includingprobes 21 and 22. The primary nucleic acid probes may contain, forinstance, a target sequence that can recognize a nucleic acid (e.g., asequence within nucleic acid 11). Probes 21 and 22 may contain the sameor different targeting sequences, which may bind to or hybridize withthe same or different nucleic acids. As an example, as is shown in FIG.3A, probe 21 contains a first targeting sequence 25 which targets theprobe to nucleic acid 11, while probe 22 contains a second targetingsequence 26, not identical to the first targeting sequence 25 and whichtargets the probe to nucleic acid 12. The target sequence may besubstantially complementary to at least a portion of a target nucleicacid, and enough of the target sequence may be present such thatspecific binding of the nucleic acid probe to the target nucleic acidcan occur.

Primary nucleic acid probes 20 may also contain one or more “read”sequences. Two such read sequences are used in this example, although inother embodiments, there may be one, three, four, or more read sequencespresent within a primary nucleic acid probe. The read sequences may allindependently be the same or different. In addition, in one set ofembodiments, different nucleic acid probes may use one or more commonread sequences. For example, more than one read sequence may becombinatorially combined on different nucleic acid probes, therebyproducing a relatively large number of different nucleic acid probesthat can be separately identified, even though only a relatively smallnumber of read sequences are used. Thus, for example, in FIG. 3A, probe21 contains read sequences 27 and 29, while probe 22 contains readsequences 27 and 28, where the two read sequences 27 are identical, anddifferent from read sequences 28 and 29.

After primary nucleic acid probes 20 have been introduced to the sampleand allowed to interact with nucleic acids 11, 12, and 13, one or moresecondary nucleic acid probes 30 may be applied to the sample todetermine the primary nucleic acid probes. The secondary nucleic acidprobes may contain a recognition sequence able to recognize one of theread sequences present within the population of primary nucleic acidprobes. For instance, the recognition sequence may be substantiallycomplementary to at least a portion of the read sequence, such that thesecondary nucleic acid probe is able to bind to or hybridize withcorresponding primary nucleic acid probe. For instance, in this example,recognition sequence 35 is able to recognize read sequence 27. Inaddition, the secondary nucleic acid probes may contain one or moresignaling entities 33. For example, a signaling entity may be afluorescent entity attached to the probe, or a certain sequence ofnucleic acids that can be determined in some fashion. More than onesecondary sequence may be used, e.g., sequentially. For example, asshown in this figure, the initial secondary probe 30 may be removed(e.g., as discussed below) and a new secondary probe 31 may be added,containing recognition sequence 36 able to recognize read sequence 28and one or more signaling entities 33. This may also be repeatedmultiple times, e.g., to determine read sequence 29 or other readsequences that may be present.

The location of the secondary nucleic acid probes 30, 31, etc. may bedetermined by determining signaling entity 33. For example, if thesignaling entity is fluorescent, then fluorescence microscopy can beused to determine the signaling entity. In some embodiments, imaging ofa sample to determine the signaling entity may be used at relativelyhigh resolutions, and in some cases, super-resolution imaging techniques(e.g., resolutions better than the wavelength of visible light or thediffraction limit of light) may be used. Examples of super-resolutionimaging techniques include STORM, or other techniques as discussedherein. In some cases, e.g., with certain super-resolution imagingtechniques such as STORM, more than one image of the sample may beacquired.

More than one type of secondary nucleic acid probe may be applied to acell or other sample. For example, a first secondary nucleic acid probemay be applied that can recognize a first read sequence, then it or itsattached signaling entity may be inactivated or removed, and a secondsecondary nucleic acid probe may be applied that can recognize a secondread sequence. This process may be repeated multiple times, each with adifferent secondary nucleic acid probe, e.g., to determine the readsequences that were present in the various primary nucleic acid probes.Thus, primary nucleic acids within the sample can be determined on thebasis of the binding pattern of secondary nucleic acid probes.

For example, a first location within the cell or other sample mayexhibit binding of a first secondary probe and a third secondary probe,but not the binding of a second or a fourth secondary probe, while asecond location may exhibit a different pattern of binding of varioussecondary probes. The primary nucleic acid probe that the secondaryprobes are able to bind to or hybridize with may be determined byconsidering the pattern of binding of various secondary probes. Forinstance, referring to FIG. 3A, if a first secondary probe is able todetermine read sequence 27, a second secondary probe is able todetermine read sequence 28, and a third secondary probe is able todetermine read sequence 29, then primary nucleic acid 25 may bedetermined through the binding of the first and third secondary probes(but not the second secondary probe), while primary nucleic acid 26 maybe determined through the binding of the first and second secondaryprobes (but not the third secondary probe).

Similarly, if it is known that first probe 21 contains target sequence25 while second probe 22 contains target sequence 26, then nucleic acids11 and 12 may also be determined within the sample, e.g., spatially,based on the binding pattern of the various secondary nucleic acidprobes. In addition, it should be noted that due to the presence of morethan one read sequence on the primary nucleic acid probes, even thoughfirst probe 21 and second probe 22 contains a common read sequence (readsequence 27), these probes may be distinguished in the sample due to thedifferent binding patterns of the various secondary nucleic acid probes.

In certain embodiments, this pattern of binding or hybridization of thesecondary nucleic acid probes may be converted into a “codeword.” Inthis example, for instance, the codewords are “101” and “110” for firstprobe 21 and second probe 22, respectively, where a value of 1represents binding and a value of 0 represents no binding. The codewordsmay also have longer lengths in other embodiments; only three probes areshown here for clarity purposes only. A codeword can be directly relatedto a specific target nucleic acid sequence of the primary nucleic acidprobe. Accordingly, different primary nucleic acid probes may matchcertain codewords, which can then be used to identify the differenttargets of the primary nucleic acid probes based on the binding patternsof the secondary probes, even if in some cases, there is overlap in theread sequences of different secondary probes, e.g., as was shown in FIG.3A. However, if no binding is evident (e.g., for nucleic acid 13), thenthe codeword would be “000” in this example.

The values in each codeword can also be assigned in different fashionsin some embodiments. For example, a value of 0 could represent bindingwhile a value of 1 represents no binding. Similarly, a value of 1 couldrepresent binding of a secondary nucleic acid probe with one type ofsignaling entity while a value of 0 could represent binding of asecondary nucleic acid probe with another type of distinguishablesignaling entity. These signaling entities could be distinguished, forexample, via different colors of fluorescence. In some cases, values incodewords need not be confined to 0 and 1. The values could also bedrawn from larger alphabets, such as ternary (e.g., 0, 1, and 2) orquaternary (e.g., 0, 1, 2, and 3) systems. Each different value could,for example, be represented by a different distinguishable signalingentity, including (in some cases) one value that may be represented bythe absence of signal.

The codewords for each target may be assigned sequentially, or may beassigned at random. For instance, referring to FIG. 3A, a first nucleicacid target may be assigned to 101, while a second nucleic acid targetmay be assigned to 110. In addition, in some embodiments, the codewordsmay be assigned using an error-detection system or an error-correctingsystem, such as a Hamming system, a Golay code, or an extended Hammingsystem (or a SECDED system, i.e., single error correction, double errordetection). Generally speaking, such systems can be used to identifywhere errors have occurred, and in some cases, such systems can also beused to correct the errors and determine what the correct codewordshould have been. For example, a codeword such as 001 may be detected asinvalid and corrected using such a system to 101, e.g., if 001 is notpreviously assigned to a different target sequence. A variety ofdifferent error-correcting codes can be used, many of which havepreviously been developed for use within the computer industry; however,such error-correcting systems have not typically been used withinbiological systems. Additional examples of such error-correcting codesare discussed in more detail below.

It should also be understood that all possible codewords in a code neednot be used in some cases. For example, in some embodiments, codewordsthat are not used can serve as negative controls. Similarly, in someembodiments, some codewords can be left out because they are more proneto errors in measurement than other codewords. For example, in someimplementations, reading a codeword with more values of ‘1’ might bemore error-prone that reading a codeword with fewer values of ‘1.’

It should be understood that the above description is an example of oneembodiment of the invention, and that primary and secondary nucleic acidprobes are not necessary in all embodiments. For example, in someembodiments, a series of nucleic acid probes containing signalingentities are used to determine nucleic acids within a cell or othersample, without necessarily requiring secondary probes.

For example, turning now to FIG. 3B, nucleic acids 11, 12, and 13 areexposed to different rounds of probes 21, 22, 23, 24, etc. in thisexample. These probes may each contain a target sequence that canrecognize a nucleic acid (e.g., a sequence within nucleic acid 11 or12). These probes may each target the same nucleic acid, but differentregions of the nucleic acid. In addition, some or all of the probes maycontain one or more signaling entities, e.g., signaling entity 29 onprobe 21. For example, the signaling entity may be a fluorescent entityattached to the probe, or a certain sequence of nucleic acids that canbe determined in some fashion.

The first round of probes (e.g. probe 21 and probe 22) may be applied tothe cell or other sample. Probe 21 may be allowed to bind to nucleicacid 11 via target sequence 25. Such binding can be determined bydetermining signaling entity 29. For example, if the signaling entity isfluorescent, then fluorescence microscopy can be used to determine thesignaling entity, e.g., spatially within the cell or other sample. Insome but not all embodiments, imaging of a sample to determine thesignaling entity may be used at relatively high resolutions, and in somecases, super-resolution imaging techniques may be used. Other, differentprobes may be present as well; for instance, probe 22 containing targetsequence 26 may bind to nucleic acid 12, and be determined via signalingentity 29 within probe 22. These may occur, e.g., sequentially orsimultaneously. Optionally, probes 21 and 22 may also be removed orinactivated, e.g., between application of different rounds of probes.

Next, a second round of probes (e.g., probe 23) is applied to thesample. In this example, probe 23 is able to bind to nucleic acid 11 viaa targeting region, although there is no probe in the second round thatis able to bind to nucleic acid 12. Binding of the probes is allowed tooccur as discussed above, and determination of binding may occur viasignaling entities. These signaling entities may be the same ordifferent as from the first round of probes. This process may berepeated any number of times with different probes. For example, as isshown in FIG. 3B, round 2 contains probes able to bind to nucleic acid11, while round 3 contains probes able to bind to nucleic acid 12.

In certain embodiments, each round of binding or hybridization ofnucleic acid probes may be converted into a “codeword.” In this example,using probes 21, 22, 23, and 24, the codewords 101 or 110 could beformed, where 1 represents binding and 0 represents no binding and thefirst position corresponds to the binding of probes 21 or 22 while thesecond position corresponds to the binding of probes 22, and the thirdposition corresponds to the binding of probe 24. A codeword of 000 wouldrepresent no binding, e.g., as shown with nucleic acid 13 in thisexample. A codeword can be directly related to a specific target nucleicacid sequence of the nucleic acid probes, by designing appropriatenucleic acid probes. Thus, for example, 110 may correspond to a firsttarget nucleic acid 12 (e.g., the first and second round of nucleic acidprobes containing probes able to target nucleic acid 11, and theseprobes may target the same or different regions of nucleic acid 11)while 101 may correspond to a second target nucleic acid (e.g., thefirst and third round of nucleic acid probes containing probes able totarget nucleic acid 12, and these probes may target the same ordifferent regions of nucleic acid 12). In addition, it should be notedthat each round of probes may contain the same, or different signalingentities as other probes in the same round, and/or other probes indifferent rounds. For instance, in one set of embodiments, only onesignaling entity is used in all of the rounds of probes.

Similar to the above, the codewords for each target may be assignedsequentially, or may be assigned at random. The codewords may beassigned within a code space in some embodiments using anerror-detection or an error-correcting system, such as a Hamming system,a Golay code, or an extended Hamming system or a SECDED system (singleerror correction, double error detection). Generally speaking, sucherror-correction systems can be used to identify where errors haveoccurred, and in some cases, such systems can also be used to correctthe errors and determine what the correct codeword should have been.

Similar to the above, the values at each position in the codeword can bearbitrarily assigned in certain embodiments to binding or non-binding ofprobes that contain more than one distinguishable signaling entity.

In some cases, the nucleic acid probes may be formed into “pools” orgroups of nucleic acids that share a common feature. For example, probesto all targets with codewords that contain a 1 in the first position,e.g. 110 and 101 but not 011, may comprise one pool while probes to alltargets that contain a 1 in the second position, e.g. 110 and 011 butnot 101, may comprise another pool. See also FIG. 1C. In some cases, anucleic acid probe may be a member of more than one group or pool.Members of a nucleic acid pool may also contain features in addition totarget sequences, read sequences, and or signaling entities that allowthem to be distinguished from other groups. These features may be shortnucleic acid sequences that are used for the amplification, production,or separation of these sequences. The nucleic acid probes of each groupmay be applied to a sample, e.g., sequentially, as discussed herein.

Thus, in some aspects, the present invention is generally directed tosystems and methods for determining nucleic acids within a cell or othersample. The sample may include a cell culture, a suspension of cells, abiological tissue, a biopsy, an organism, or the like. The sample mayalso be cell-free but nevertheless contain nucleic acids. If the samplecontains a cell, the cell may be a human cell, or any other suitablecell, e.g., a mammalian cell, a fish cell, an insect cell, a plant cell,or the like. More than one cell may be present in some cases.

The nucleic acids to be determined may be, for example, DNA, RNA, orother nucleic acids that are present within a cell (or other sample).The nucleic acids may be endogenous to the cell, or added to the cell.For instance, the nucleic acid may be viral, or artificially created. Insome cases, the nucleic acid to be determined may be expressed by thecell. The nucleic acid is RNA in some embodiments. The RNA may be codingand/or non-coding RNA. Non-limiting examples of RNA that may be studiedwithin the cell include mRNA, siRNA, rRNA, miRNA, tRNA, lncRNA, snoRNAs,snRNAs, exRNAs, piRNAs, or the like.

In some cases, a significant portion of the nucleic acid within the cellmay be studied. For instance, in some cases, enough of the RNA presentwithin a cell may be determined so as to produce a partial or completetranscriptome of the cell. In some cases, at least 4 types of mRNAs aredetermined within a cell, and in some cases, at least 3, at least 4, atleast 7, at least 8, at least 12, at least 14, at least 15, at least 16,at least 22, at least 30, at least 31, at least 32, at least 50, atleast 63, at least 64, at least 72, at least 75, at least 100, at least127, at least 128, at least 140, at least 255, at least 256, at least500, at least 1,000, at least 1,500, at least 2,000, at least 2,500, atleast 3,000, at least 4,000, at least 5,000, at least 7,500, at least10,000, at least 12,000, at least 15,000, at least 20,000, at least25,000, at least 30,000, at least 40,000, at least 50,000, at least75,000, or at least 100,000 types of mRNAs may be determined within acell.

In some cases, the transcriptome of a cell may be determined. It shouldbe understood that the transcriptome generally encompasses all RNAmolecules produced within a cell, not just mRNA. Thus, for instance, thetranscriptome may also include rRNA, tRNA, siRNA, etc. In someembodiments, at least 5%, at least 10%, at least 15%, at least 20%, atleast 25%, at least 30%, at least 40%, at least 50%, at least 60%, atleast 70%, at least 80%, at least 90%, or 100% of the transcriptome of acell may be determined.

The determination of one or more nucleic acids within the cell or othersample may be qualitative and/or quantitative. In addition, thedetermination may also be spatial, e.g., the position of the nucleicacid within the cell or other sample may be determined in two or threedimensions. In some embodiments, the positions, number, and/orconcentrations of nucleic acids within the cell (or other sample) may bedetermined.

In some cases, a significant portion of the genome of a cell may bedetermined. The determined genomic segments may be continuous orinterspersed on the genome. For example, in some cases, at least 4genomic segments are determined within a cell, and in some cases, atleast 3, at least 4, at least 7, at least 8, at least 12, at least 14,at least 15, at least 16, at least 22, at least 30, at least 31, atleast 32, at least 50, at least 63, at least 64, at least 72, at least75, at least 100, at least 127, at least 128, at least 140, at least255, at least 256, at least 500, at least 1,000, at least 1,500, atleast 2,000, at least 2,500, at least 3,000, at least 4,000, at least5,000, at least 7,500, at least 10,000, at least 12,000, at least15,000, at least 20,000, at least 25,000, at least 30,000, at least40,000, at least 50,000, at least 75,000, or at least 100,000 genomicsegments may be determined within a cell.

In some cases, the entire genome of a cell may be determined. It shouldbe understood that the genome generally encompasses all DNA moleculesproduced within a cell, not just chromosome DNA. Thus, for instance, thegenome may also include, in some cases, mitochondria DNA, chloroplastDNA, plasmid DNA, etc. In some embodiments, at least about 5%, at leastabout 10%, at least about 15%, at least about 20%, at least about 25%,at least about 30%, at least about 40%, at least about 50%, at leastabout 60%, at least about 70%, at least about 80%, at least about 90%,or 100% of the genome of a cell may be determined.

As discussed herein, a variety of nucleic acid probes may be used todetermine one or more nucleic acids within a cell or other sample. Theprobes may comprise nucleic acids (or entities that can hybridize to anucleic acid, e.g., specifically) such as DNA, RNA, LNA (locked nucleicacids), PNA (peptide nucleic acids), or combinations thereof. In somecases, additional components may also be present within the nucleic acidprobes, e.g., as discussed below. Any suitable method may be used tointroduce nucleic acid probes into a cell.

For example, in some embodiments, the cell is fixed prior to introducingthe nucleic acid probes, e.g., to preserve the positions of the nucleicacids within the cell. Techniques for fixing cells are known to those ofordinary skill in the art. As non-limiting examples, a cell may be fixedusing chemicals such as formaldehyde, paraformaldehyde, glutaraldehyde,ethanol, methanol, acetone, acetic acid, or the like. In one embodiment,a cell may be fixed using Hepes-glutamic acid buffer-mediated organicsolvent (HOPE).

The nucleic acid probes may be introduced into the cell (or othersample) using any suitable method. In some cases, the cell may besufficiently permeabilized such that the nucleic acid probes may beintroduced into the cell by flowing a fluid containing the nucleic acidprobes around the cells. In some cases, the cells may be sufficientlypermeabilized as part of a fixation process; in other embodiments, cellsmay be permeabilized by exposure to certain chemicals such as ethanol,methanol, Triton, or the like. In addition, in some embodiments,techniques such as electroporation or microinjection may be used tointroduce nucleic acid probes into a cell or other sample.

Certain aspects of the present invention are generally directed tonucleic acid probes that are introduced into a cell (or other sample).The probes may comprise any of a variety of entities that can hybridizeto a nucleic acid, typically by Watson-Crick base pairing, such as DNA,RNA, LNA, PNA, etc., depending on the application. The nucleic acidprobe typically contains a target sequence that is able to bind to atleast a portion of a target nucleic acid, in some cases specifically.When introduced into a cell or other system, the target system may beable to bind to a specific target nucleic acid (e.g., an mRNA, or othernucleic acids as discussed herein). In some cases, the nucleic acidprobes may be determined using signaling entities (e.g., as discussedbelow), and/or by using secondary nucleic acid probes able to bind tothe nucleic acid probes (i.e., to primary nucleic acid probes). Thedetermination of such nucleic acid probes is discussed in detail below.

In some cases, more than one type of (primary) nucleic acid probe may beapplied to a sample, e.g., simultaneously. For example, there may be atleast 2, at least 5, at least 10, at least 25, at least 50, at least 75,at least 100, at least 300, at least 1,000, at least 3,000, at least10,000, or at least 30,000 distinguishable nucleic acid probes that areapplied to a sample, e.g., simultaneously or sequentially.

The target sequence may be positioned anywhere within the nucleic acidprobe (or primary nucleic acid probe or encoding nucleic acid probe).The target sequence may contain a region that is substantiallycomplementary to a portion of a target nucleic acid. In some cases, theportions may be at least 50%, at least 60%, at least 70%, at least 75%,at least 80%, at least 85%, at least 90%, at least 92%, at least 94%, atleast 95%, at least 96%, at least 97%, at least 98%, at least 99%, or100% complementary. In some cases, the target sequence may be at least5, at least 10, at least 15, at least 20, at least 25, at least 30, atleast 35, at least 40, at least 50, at least 60, at least 65, at least75, at least 100, at least 125, at least 150, at least 175, at least200, at least 250, at least 300, at least 350, at least 400, or at least450 nucleotides in length. In some cases, the target sequence may be nomore than 500, no more than 450, no more than 400, no more than 350, nomore than 300, no more than 250, no more than 200, no more than 175, nomore than 150, no more than 125, no more than 100, be no more than 75,no more than 60, no more than 65, no more than 60, no more than 55, nomore than 50, no more than 45, no more than 40, no more than 35, no morethan 30, no more than 20, or no more than 10 nucleotides in length.Combinations of any of these are also possible, e.g., the targetsequence may have a length of between 10 and 30 nucleotides, between 20and 40 nucleotides, between 5 and 50 nucleotides, between 10 and 200nucleotides, or between 25 and 35 nucleotides, between 10 and 300nucleotides, etc. Typically, complementarity is determined on the basisof Watson-Crick nucleotide base pairing.

The target sequence of a (primary) nucleic acid probe may be determinedwith reference to a target nucleic acid suspected of being presentwithin a cell or other sample. For example, a target nucleic acid to aprotein may be determined using the protein's sequence, by determiningthe nucleic acids that are expressed to form the protein. In some cases,only a portion of the nucleic acids encoding the protein are used, e.g.,having the lengths as discussed above. In addition, in some cases, morethan one target sequence that can be used to identify a particulartarget may be used. For instance, multiple probes can be used,sequentially and/or simultaneously, that can bind to or hybridize todifferent regions of the same target. Hybridization typically refers toan annealing process by which complementary single-stranded nucleicacids associate through Watson-Crick nucleotide base pairing (e.g.,hydrogen bonding, guanine-cytosine and adenine-thymine) to formdouble-stranded nucleic acid.

In some embodiments, a nucleic acid probe, such as a primary nucleicacid probe, may also comprise one or more “read” sequences. However, itshould be understood that read sequences are not necessary in all cases.In some embodiments, the nucleic acid probe may comprise 1, 2, 3, 4, 5,6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 or more, 20 or more, 32 or more,40 or more, 50 or more, 64 or more, 75 or more, 100 or more, 128 or moreread sequences. The read sequences may be positioned anywhere within thenucleic acid probe. If more than one read sequence is present, the readsequences may be positioned next to each other, and/or interspersed withother sequences.

The read sequences, if present, may be of any length. If more than oneread sequence is used, the read sequences may independently have thesame or different lengths. For instance, the read sequence may be atleast 5, at least 10, at least 15, at least 20, at least 25, at least30, at least 35, at least 40, at least 50, at least 60, at least 65, atleast 75, at least 100, at least 125, at least 150, at least 175, atleast 200, at least 250, at least 300, at least 350, at least 400, or atleast 450 nucleotides in length. In some cases, the read sequence may beno more than 500, no more than 450, no more than 400, no more than 350,no more than 300, no more than 250, no more than 200, no more than 175,no more than 150, no more than 125, no more than 100, be no more than75, no more than 60, no more than 65, no more than 60, no more than 55,no more than 50, no more than 45, no more than 40, no more than 35, nomore than 30, no more than 20, or no more than 10 nucleotides in length.Combinations of any of these are also possible, e.g., the read sequencemay have a length of between 10 and 30 nucleotides, between 20 and 40nucleotides, between 5 and 50 nucleotides, between 10 and 200nucleotides, or between 25 and 35 nucleotides, between 10 and 300nucleotides, etc.

The read sequence may be arbitrary or random in some embodiments. Incertain cases, the read sequences are chosen so as to reduce or minimizehomology with other components of the cell or other sample, e.g., suchthat the read sequences do not themselves bind to or hybridize withother nucleic acids suspected of being within the cell or other sample.In some cases, the homology may be less than 10%, less than 8%, lessthan 7%, less than 6%, less than 5%, less than 4%, less than 3%, lessthan 2%, or less than 1%. In some cases, there may be a homology of lessthan 20 basepairs, less than 18 basepairs, less than 15 basepairs, lessthan 14 basepairs, less than 13 basepairs, less than 12 basepairs, lessthan 11 basepairs, or less than 10 basepairs. In some cases, thebasepairs are sequential.

In one set of embodiments, a population of nucleic acid probes maycontain a certain number of read sequences, which may be less than thenumber of targets of the nucleic acid probes in some cases. Those ofordinary skill in the art will be aware that if there is one signalingentity and n read sequences, then in general 2^(n)−1 different nucleicacid targets may be uniquely identified. However, not all possiblecombinations need be used. For instance, a population of nucleic acidprobes may target 12 different nucleic acid sequences, yet contain nomore than 8 read sequences. As another example, a population of nucleicacids may target 140 different nucleic acid species, yet contain no morethan 16 read sequences. Different nucleic acid sequence targets may beseparately identified by using different combinations of read sequenceswithin each probe. For instance, each probe may contain 1, 2, 3, 4, 5,6, 7, 8, 9, 10, 11, 12, 13, 14, 15, etc. or more read sequences. In somecases, a population of nucleic acid probes may each contain the samenumber of read sequences, although in other cases, there may bedifferent numbers of read sequences present on the various probes.

As a non-limiting example, a first nucleic acid probe may contain afirst target sequence, a first read sequence, and a second readsequence, while a second, different nucleic acid probe may contain asecond target sequence, the same first read sequence, but a third readsequence instead of the second read sequence. Such probes may thereby bedistinguished by determining the various read sequences present orassociated with a given probe or location, as discussed herein.

In addition, the nucleic acid probes (and their corresponding,complimentary sites on the encoding probes), in certain embodiments, maybe made using only 2 or only 3 of the 4 bases, such as leaving out allthe “G”s or leaving out all of the “C”s within the probe. Sequenceslacking either “G”s or “C”s may form very little secondary structure incertain embodiments, and can contribute to more uniform, fasterhybridization.

In some embodiments, the nucleic acid probe may contain a signalingentity. It should be understood that signaling entities are not requiredin all cases, however; for instance, the nucleic acid probe may bedetermined using secondary nucleic acid probes in some embodiments, asis discussed in additional detail below. Examples of signaling entitiesthat can be used are also discussed in more detail below.

Other components may also be present within a nucleic acid probe aswell. For example, in one set of embodiments, one or more primersequences may be present, e.g., to allow for enzymatic amplification ofprobes. Those of ordinary skill in the art will be aware of primersequences suitable for applications such as amplification (e.g., usingPCR or other suitable techniques). Many such primer sequences areavailable commercially. Other examples of sequences that may be presentwithin a primary nucleic acid probe include, but are not limited topromoter sequences, operons, identification sequences, nonsensesequences, or the like.

Typically, a primer is a single-stranded or partially double-strandednucleic acid (e.g., DNA) that serves as a starting point for nucleicacid synthesis, allowing polymerase enzymes such as nucleic acidpolymerase to extend the primer and replicate the complementary strand.A primer is (e.g., is designed to be) complementary to and to hybridizeto a target nucleic acid. In some embodiments, a primer is a syntheticprimer. In some embodiments, a primer is a non-naturally-occurringprimer. A primer typically has a length of 10 to 50 nucleotides. Forexample, a primer may have a length of 10 to 40, 10 to 30, 10 to 20, 25to 50, 15 to 40, 15 to 30, 20 to 50, 20 to 40, or 20 to 30 nucleotides.In some embodiments, a primer has a length of 18 to 24 nucleotides.

In addition, the components of the nucleic acid probe may be arranged inany suitable order. For instance, in one embodiment, the components maybe arranged in a nucleic acid probe as: primer—read sequences—targetingsequence—read sequences—reverse primer. The “read sequences” in thisstructure may each contain any number (including 0) of read sequences,so long as at least one read sequence is present in the probe.Non-limiting example structures include primer—targeting sequence—readsequences—reverse primer, primer—read sequences—targetingsequence—reverse primer, targeting sequence—primer—targetingsequence—read sequences—reverse primer, targeting sequence—primer—readsequences—targeting sequence—reverse primer, primer—target sequence—readsequences—targeting sequence—reverse primer, targetingsequence—primer—read sequence—reverse primer, targeting sequence—readsequence—primer, read sequence—targeting sequence—primer, readsequence—primer—targeting sequence—reverse primer, etc. In addition, thereverse primer is optional in some embodiments, including in all of theabove-described examples.

After introduction of the nucleic acid probes into a cell or othersample, the nucleic acid probes may be directly determined bydetermining signaling entities (if present), and/or the nucleic acidprobes may be determined by using one or more secondary nucleic acidprobes, in accordance with certain aspects of the invention. Asmentioned, in some cases, the determination may be spatial, e.g., in twoor three dimensions. In addition, in some cases, the determination maybe quantitative, e.g., the amount or concentration of a primary nucleicacid probe (and of a target nucleic acid) may be determined.Additionally, the secondary probes may comprise any of a variety ofentities able to hybridize a nucleic acid, e.g., DNA, RNA, LNA, and/orPNA, etc., depending on the application. Signaling entities arediscussed in more detail below.

A secondary nucleic acid probe may contain a recognition sequence ableto bind to or hybridize with a read sequence of a primary nucleic acidprobe. In some cases, the binding is specific, or the binding may besuch that a recognition sequence preferentially binds to or hybridizeswith only one of the read sequences that are present. The secondarynucleic acid probe may also contain one or more signaling entities. Ifmore than one secondary nucleic acid probe is used, the signalingentities may be the same or different.

The recognition sequences may be of any length, and multiple recognitionsequences may be of the same or different lengths. If more than onerecognition sequence is used, the recognition sequences mayindependently have the same or different lengths. For instance, therecognition sequence may be at least 5, at least 10, at least 15, atleast 20, at least 25, at least 30, at least 35, at least 40, or atleast 50 nucleotides in length. In some cases, the recognition sequencemay be no more than 75, no more than 60, no more than 65, no more than60, no more than 55, no more than 50, no more than 45, no more than 40,no more than 35, no more than 30, no more than 20, or no more than 10nucleotides in length. Combinations of any of these are also possible,e.g., the recognition sequence may have a length of between 10 and 30,between 20 and 40, or between 25 and 35 nucleotides, etc. In oneembodiment, the recognition sequence is of the same length as the readsequence. In addition, in some cases, the recognition sequence may be atleast 50%, at least 60%, at least 70%, at least 75%, at least 80%, atleast 85%, at least 90%, at least 92%, at least 94%, at least 95%, atleast 96%, at least 97%, at least 98%, at least 99%, or at least 100%complementary to a read sequence of the primary nucleic acid probe.

As mentioned, in some cases, the secondary nucleic acid probe maycomprise one or more signaling entities. Examples of signaling entitiesare discussed in more detail below.

As discussed, in certain aspects of the invention, nucleic acid probesare used that contain various “read sequences.” For example, apopulation of primary nucleic acid probes may contain certain “readsequences” which can bind certain of the secondary nucleic acid probes,and the locations of the primary nucleic acid probes are determinedwithin the sample using secondary nucleic acid probes, e.g., whichcomprise a signaling entity. As mentioned, in some cases, a populationof read sequences may be combined in various combinations to producedifferent nucleic acid probes, e.g., such that a relatively small numberof read sequences may be used to produce a relatively large number ofdifferent nucleic acid probes.

Thus, in some cases, a population of primary nucleic acid probes (orother nucleic acid probes) may each contain a certain number of readsequences, some of which are shared between different primary nucleicacid probes such that the total population of primary nucleic acidprobes may contain a certain number of read sequences. A population ofnucleic acid probes may have any suitable number of read sequences. Forexample, a population of primary nucleic acid probes may have 1, 2, 3,4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 etc. readsequences. More than 20 are also possible in some embodiments. Inaddition, in some cases, a population of nucleic acid probes may, intotal, have 1 or more, 2 or more, 3 or more, 4 or more, 5 or more, 6 ormore, 7 or more, 8 or more, 9 or more, 10 or more, 11 or more, 12 ormore, 13 or more, 14 or more, 15 or more, 16 or more, 20 or more, 24 ormore, 32 or more, 40 or more, 50 or more, 60 or more, 64 or more, 100 ormore, 128 or more, etc. of possible read sequences present, althoughsome or all of the probes may each contain more than one read sequence,as discussed herein. In addition, in some embodiments, the population ofnucleic acid probes may have no more than 100, no more than 80, no morethan 64, no more than 60, no more than 50, no more than 40, no more than32, no more than 24, no more than 20, no more than 16, no more than 15,no more than 14, no more than 13, no more than 12, no more than 11, nomore than 10, no more than 9, no more than 8, no more than 7, no morethan 6, no more than 5, no more than 4, no more than 3, or no more thantwo read sequences present. Combinations of any of these are alsopossible, e.g., a population of nucleic acid probes may comprise between10 and 15 read sequences in total.

As a non-limiting example of an approach to combinatorially producing arelatively large number of nucleic acid probes from a relatively smallnumber of read sequences, in a population of 6 different types ofnucleic acid probes, each comprising one or more read sequences, thetotal number of read sequences within the population may be no greaterthan 4. It should be understood that although 4 read sequences are usedin this example for ease of explanation, in other embodiments, largernumbers of nucleic acid probes may be realized, for example, using 5, 8,10, 16, 32, etc. or more read sequences, or any other suitable number ofread sequences described herein, depending on the application. Referringnow to FIG. 4A, if each of the primary nucleic acid probes contains twodifferent read sequences, then by using 4 such read sequences (A, B, C,and D), up to 6 probes may be separately identified. It should be notedthat in this example, the ordering of read sequences on a nucleic acidprobe is not essential, i.e., “AB” and “BA” may be treated as beingsynonymous (although in other embodiments, the ordering of readsequences may be essential and “AB” and “BA” may not necessarily besynonymous). Similarly, if 5 read sequences are used (A, B, C, D, and E)in the population of primary nucleic acid probes, up to 10 probes may beseparately identified, as is shown in FIG. 4B. For example, one ofordinary skill in the art would understand that, for k read sequences ina population with n read sequences on each probe, up to

$\begin{pmatrix}n \\k\end{pmatrix}$

different probes may be produced, assuming that the ordering of readsequences is not essential; because not all of the probes need to havethe same number of read sequences and not all combinations of readsequences need to be used in every embodiment, either more or less thanthis number of different probes may also be used in certain embodiments.In addition, it should also be understood that the number of readsequences on each probe need not be identical in some embodiments. Forinstance example, some probes may contain 2 read sequences while otherprobes may contain 3 read sequences.

In some aspects, the read sequences and/or the pattern of binding ofnucleic acid probes within a sample may be used to define anerror-detecting and/or an error-correcting code, for example, to reduceor prevent misidentification or errors of the nucleic acids, e.g., aswas discussed with reference to FIG. 3 . Thus, for example, if bindingis indicated (e.g., as determined using a signaling entity), then thelocation may be identified with a “1”; conversely, if no binding isindicated, then the location may be identified with a “0” (or viceversa, in some cases). Multiple rounds of binding determinations, e.g.,using different nucleic acid probes, can then be used to create a“codeword,” e.g., for that spatial location. In some embodiments, thecodeword may be subjected to error detection and/or correction. Forinstance, the codewords may be organized such that, if no match is foundfor a given set of read sequences or binding pattern of nucleic acidprobes, then the match may be identified as an error, and optionally,error correction may be applied sequences to determine the correcttarget for the nucleic acid probes. In some cases, the codewords mayhave fewer “letters” or positions that the total number of nucleic acidsencoded by the codewords, e.g. where each codeword encodes a differentnucleic acid.

Such error-detecting and/or the error-correction code may take a varietyof forms. A variety of such codes have previously been developed inother contexts such as the telecommunications industry, such as Golaycodes or Hamming codes. In one set of embodiments, the read sequences orbinding patterns of the nucleic acid probes are assigned such that notevery possible combination is assigned.

For example, if 4 read sequences are possible and a primary nucleic acidprobe contains 2 read sequences, then up to 6 primary nucleic acidprobes could be identified; but the number of primary nucleic acidprobes used may be less than 6. Similarly, for k read sequences in apopulation with n read sequences on each primary nucleic acid probe,

$\begin{pmatrix}n \\k\end{pmatrix}$

different probes may be produced, but the number of primary nucleic acidprobes that are used may be any number more or less than

$\begin{pmatrix}n \\k\end{pmatrix}.$

In addition, these may be randomly assigned, or assigned in specificways to increase the ability to detect and/or correct errors.

As another example, if multiple rounds of nucleic acid probes are used,the number of rounds may be arbitrarily chosen. If in each round, eachtarget can give two possible outcomes, such as being detected or notbeing detected, up to 2^(n) different targets may be possible for nrounds of probes, but the number of nucleic acid targets that areactually used may be any number less than 2^(n). For example, if in eachround, each target can give more than two possible outcomes, such asbeing detected in different color channels, more than 2^(n) (e.g. 3^(n),4^(n) . . . ) different targets may be possible for n rounds of probes.In some cases, the number of nucleic acid targets that are actually usedmay be any number less than this number. In addition, these may berandomly assigned, or assigned in specific ways to increase the abilityto detect and/or correct errors.

For example, in one set of embodiments, the codewords or nucleic acidprobes may be assigned within a code space such that the assignments areseparated by a Hamming distance, which measures the number of incorrect“reads” in a given pattern that cause the nucleic acid probe to bemisinterpreted as a different valid nucleic acid probe. In certaincases, the Hamming distance may be at least 2, at least 3, at least 4,at least 5, at least 6, or the like. In addition, in one set ofembodiments, the assignments may be formed as a Hamming code, forinstance, a Hamming(7, 4) code, a Hamming(15, 11) code, a Hamming(31,26) code, a Hamming(63, 57) code, a Hamming(127, 120) code, etc. Inanother set of embodiments, the assignments may form a SECDED code,e.g., a SECDED(8,4) code, a SECDED(16,4) code, a SCEDED(16, 11) code, aSCEDED(22, 16) code, a SCEDED(39, 32) code, a SCEDED(72, 64) code, etc.In yet another set of embodiments, the assignments may form an extendedbinary Golay code, a perfect binary Golay code, or a ternary Golay code.In another set of embodiments, the assignments may represent a subset ofthe possible values taken from any of the codes described above.

For example, a code with the same error correcting properties of theSECDED code may be formed by using only binary words that contain afixed number of ‘1’ bits, such as 4, to encode the targets. In anotherset of embodiments, the assignments may represent a subset of thepossible values taken from codes described above for the purpose ofaddressing asymmetric readout errors. For example, in some cases, a codein which the number of ‘1’ bits may be fixed for all used binary wordsmay eliminate the biased measurement of words with different numbers of‘1’s when the rate at which ‘0’ bits are measured as ‘1’s or ‘1’ bitsare measured as ‘0’s are different.

Accordingly, in some embodiments, once the codeword is determined (e.g.,as discussed herein), the codeword may be compared to the known nucleicacid codewords. If a match is found, then the nucleic acid target can beidentified or determined. If no match is found, then an error in thereading of the codeword may be identified. In some cases, errorcorrection can also be applied to determine the correct codeword, andthus resulting in the correct identity of the nucleic acid target. Insome cases, the codewords may be selected such that, assuming that thereis only one error present, only one possible correct codeword isavailable, and thus, only one correct identity of the nucleic acidtarget is possible. In some cases, this may also be generalized tolarger codeword spacings or Hamming distances; for instance, thecodewords may be selected such that if two, three, or four errors arepresent (or more in some cases), only one possible correct codeword isavailable, and thus, only one correct identity of the nucleic acidtargets is possible.

The error-correcting code may be a binary error-correcting code, or itmay be based on other numbering systems, e.g., ternary or quaternaryerror-correcting codes. For instance, in one set of embodiments, morethan one type of signaling entity may be used and assigned to differentnumbers within the error-correcting code. Thus, as a non-limitingexample, a first signaling entity (or more than one signaling entity, insome cases) may be assigned as “1” and a second signaling entity (ormore than one signaling entity, in some cases) may be assigned as “2”(with “0” indicating no signaling entity present), and the codewordsdistributed to define a ternary error-correcting code. Similarly, athird signaling entity may additionally be assigned as “3” to make aquaternary error-correcting code, etc.

As discussed above, in certain aspects, signaling entities aredetermined, e.g., to determine nucleic acid probes and/or to createcodewords. In some cases, signaling entities within a sample may bedetermined, e.g., spatially, using a variety of techniques. In someembodiments, the signaling entities may be fluorescent, and techniquesfor determining fluorescence within a sample, such as fluorescencemicroscopy or confocal microscopy, may be used to spatially identify thepositions of signaling entities within a cell. In some cases, thepositions of entities within the sample may be determined in two or eventhree dimensions. In addition, in some embodiments, more than onesignaling entity may be determined at a time (e.g., signaling entitieswith different colors or emissions), and/or sequentially.

In addition, in some embodiments, a confidence level for the identifiednucleic acid target may be determined. For example, the confidence levelmay be determined using a ratio of the number of exact matches to thenumber of matches having one or more one-bit errors. In some cases, onlymatches having a confidence ratio greater than a certain value may beused. For instance, in certain embodiments, matches may be accepted onlyif the confidence ratio for the match is greater than about 0.01,greater than about 0.03, greater than about 0.05, greater than about0.1, greater than about 0.3, greater than about 0.5, greater than about1, greater than about 3, greater than about 5, greater than about 10,greater than about 30, greater than about 50, greater than about 100,greater than about 300, greater than about 500, greater than about 1000,or any other suitable value. In addition, in some embodiments, matchesmay be accepted only if the confidence ratio for the identified nucleicacid target is greater than an internal standard or false positivecontrol by about 0.01, about 0.03, about 0.05, about 0.1, about 0.3,about 0.5, about 1, about 3, about 5, about 10, about 30, about 50,about 100, about 300, about 500, about 1000, or any other suitable value

In some embodiments, the spatial positions of the entities (and thus,nucleic acid probes that the entities may be associated with) may bedetermined at relatively high resolutions. For instance, the positionsmay be determined at spatial resolutions of better than about 100micrometers, better than about 30 micrometers, better than about 10micrometers, better than about 3 micrometers, better than about 1micrometer, better than about 800 nm, better than about 600 nm, betterthan about 500 nm, better than about 400 nm, better than about 300 nm,better than about 200 nm, better than about 100 nm, better than about 90nm, better than about 80 nm, better than about 70 nm, better than about60 nm, better than about 50 nm, better than about 40 nm, better thanabout 30 nm, better than about 20 nm, or better than about 10 nm, etc.

There are a variety of techniques able to determine or image the spatialpositions of entities optically, e.g., using fluorescence microscopy. Insome cases, the spatial positions may be determined at superresolutions, or at resolutions better than the wavelength of light orthe diffraction limit. Non-limiting examples include STORM (stochasticoptical reconstruction microscopy), STED (stimulated emission depletionmicroscopy), NSOM (Near-field Scanning Optical Microscopy), 4Pimicroscopy, SIM (Structured Illumination Microscopy), SMI (SpatiallyModulated Illumination) microscopy, RESOLFT (Reversible SaturableOptically Linear Fluorescence Transition Microscopy), GSD (Ground StateDepletion Microscopy), SSIM (Saturated Structured-IlluminationMicroscopy), SPDM (Spectral Precision Distance Microscopy),Photo-Activated Localization Microscopy (PALM), FluorescencePhotoactivation Localization Microscopy (FPALM), LIMON (3D LightMicroscopical Nanosizing Microscopy), Super-resolution opticalfluctuation imaging (SOFI), or the like. See, e.g., U.S. Pat. No.7,838,302, issued Nov. 23, 2010, entitled “Sub-Diffraction Limit ImageResolution and Other Imaging Techniques,” by Zhuang, et al.; U.S. Pat.No. 8,564,792, issued Oct. 22, 2013, entitled “Sub-diffraction LimitImage Resolution in Three Dimensions,” by Zhuang, et al.; or Int. Pat.Apl. Pub. No. WO 2013/090360, published Jun. 20, 2013, entitled “HighResolution Dual-Objective Microscopy,” by Zhuang, et al., eachincorporated herein by reference in their entireties.

As an illustrative non-limiting example, in one set of embodiments, thesample may be imaged with a high numerical aperture, oil immersionobjective with 100× magnification and light collected on anelectron-multiplying CCD camera. In another example, the sample could beimaged with a high numerical aperture, oil immersion lens with 40×magnification and light collected with a wide-field scientific CMOScamera. With different combinations of objectives and cameras, a singlefield of view may correspond to no less than 40×40 microns, 80×80microns, 120×120 microns, 240×240 microns, 340×340 microns, or 500×500microns, etc. in various non-limiting embodiments. Similarly, a singlecamera pixel may correspond, in some embodiments, to regions of thesample of no less than 80×80 nm, 120×120 nm, 160×160 nm, 240×240 nm, or300×300 nm, etc. In another example, the sample may be imaged with a lownumerical aperture, air lens with 10× magnification and light collectedwith a sCMOS camera. In additional embodiments, the sample may beoptically sectioned by illuminating it via a single or multiple scanneddiffraction limited foci generated either by scanning mirrors or aspinning disk and the collected passed through a single or multiplepinholes. In another embodiment, the sample may also be illuminated viathin sheet of light generated via any one of multiple methods known tothose versed in the art.

In one embodiment, the sample may be illuminated by single Gaussian modelaser lines. In some embodiments, the illumination profiled may beflattened by passing these laser lines through a multimode fiber that isvibrated via piezo-electric or other mechanical means. In someembodiments, the illumination profile may be flattened by passingsingle-mode, Gaussian beams through a variety of refractive beamshapers, such as the piShaper or a series of stacked Powell lenses. Inyet another set of embodiments, the Gaussian beams may be passed througha variety of different diffusing elements, such as ground glass orengineered diffusers, which may be spun in some cases at high speeds toremove residual laser speckle. In yet another embodiment, laserillumination may be passed through a series of lenslet arrays to produceoverlapping images of the illumination that approximate a flatillumination field.

In some embodiments, the centroids of the spatial positions of theentities may be determined. For example, a centroid of a signalingentity may be determined within an image or series of images using imageanalysis algorithms known to those of ordinary skill in the art. In somecases, the algorithms may be selected to determine non-overlappingsingle emitters and/or partially overlapping single emitters in asample. Non-limiting examples of suitable techniques include a maximumlikelihood algorithm, a least squares algorithm, a Bayesian algorithm, acompressed sensing algorithm, or the like. Combinations of thesetechniques may also be used in some cases.

In addition, the signaling entity may be inactivated in some cases. Forexample, in some embodiments, a first secondary nucleic acid probecontaining a signaling entity may be applied to a sample that canrecognize a first read sequence, then the first secondary nucleic acidprobe can be inactivated before a second secondary nucleic acid probe isapplied to the sample. If multiple signaling entities are used, the sameor different techniques may be used to inactivate the signalingentities, and some or all of the multiple signaling entities may beinactivated, e.g., sequentially or simultaneously.

Inactivation may be caused by removal of the signaling entity (e.g.,from the sample, or from the nucleic acid probe, etc.), and/or bychemically altering the signaling entity in some fashion, e.g., byphotobleaching the signaling entity, bleaching or chemically alteringthe structure of the signaling entity, e.g., by reduction, etc.). Forinstance, in one set of embodiments, a fluorescent signaling entity maybe inactivated by chemical or optical techniques such as oxidation,photobleaching, chemically bleaching, stringent washing or enzymaticdigestion or reaction by exposure to an enzyme, dissociating thesignaling entity from other components (e.g., a probe), chemicalreaction of the signaling entity (e.g., to a reactant able to alter thestructure of the signaling entity) or the like. For instance, bleachingmay occur by exposure to oxygen, reducing agents, or the signalingentity could be chemically cleaved from the nucleic acid probe andwashed away via fluid flow.

In some embodiments, various nucleic acid probes (including primaryand/or secondary nucleic acid probes) may include one or more signalingentities. If more than one nucleic acid probe is used, the signalingentities may each by the same or different. In certain embodiments, asignaling entity is any entity able to emit light. For instance, in oneembodiment, the signaling entity is fluorescent. In other embodiments,the signaling entity may be phosphorescent, radioactive, absorptive,etc. In some cases, the signaling entity is any entity that can bedetermined within a sample at relatively high resolutions, e.g., atresolutions better than the wavelength of visible light or thediffraction limit. The signaling entity may be, for example, a dye, asmall molecule, a peptide or protein, or the like. The signaling entitymay be a single molecule in some cases. If multiple secondary nucleicacid probes are used, the nucleic acid probes may comprise the same ordifferent signaling entities.

Non-limiting examples of signaling entities include fluorescent entities(fluorophores) or phosphorescent entities, for example, cyanine dyes(e.g., Cy2, Cy3, Cy3B, Cy5, Cy5.5, Cy7, etc.), Alexa Fluor dyes, Attodyes, photoswitchable dyes, photoactivatable dyes, fluorescent dyes,metal nanoparticles, semiconductor nanoparticles or “quantum dots”,fluorescent proteins such as GFP (Green Fluorescent Protein), orphotoactivatable fluorescent proteins, such as PAGFP, PSCFP, PSCFP2,Dendra, Dendra2, EosFP, tdEos, mEos2, mEos3, PAmCherry, PAtagRFP,mMaple, mMaple2, and mMaple3. Other suitable signaling entities areknown to those of ordinary skill in the art. See, e.g., U.S. Pat. No.7,838,302 or U.S. Pat. Apl. Ser. No. 61/979,436, each incorporatedherein by reference in its entirety.

In one set of embodiments, the signaling entity may be attached to anoligonucleotide sequence via a bond that can be cleaved to release thesignaling entity. In one set of embodiments, a fluorophore may beconjugated to an oligonucleotide via a cleavable bond, such as aphotocleavable bond. Non-limiting examples of photocleavable bondsinclude, but are not limited to, 1-(2-nitrophenyl)ethyl, 2-nitrobenzyl,biotin phosphoramidite, acrylic phosphoramidite, diethylaminocoumarin,1-(4,5-dimethoxy-2-nitrophenyl)ethyl, cyclo-dodecyl(dimethoxy-2-nitrophenyl)ethyl, 4-aminomethyl-3-nitrobenzyl,(4-nitro-3-(1-chlorocarbonyloxyethyl)phenyl)methyl-S-acetylthioic acidester,(4-nitro-3-(1-thlorocarbonyloxyethyl)phenyl)methyl-3-(2-pyridyldithiopropionicacid) ester,3-(4,4′-dimethoxytrityl)-1-(2-nitrophenyl)-propane-1,3-diol-[2-cyanoethyl-(N,N-diisopropyl)]-phosphoramidite,1-[2-nitro-5-(6-trifluoroacetylcaproamidomethyl)phenyl]-ethyl-[2-cyano-ethyl-(N,N-diisopropyl)]-phosphoramidite,1-[2-nitro-5-(6-(4,4′-dimethoxytrityloxy)butyramidomethyl)phenyl]-ethyl-[2-cyanoethyl-(N,N-diisopropyl)]-phosphoramidite,1-[2-nitro-5-(6-(N-(4,4′-dimethoxytrityl))-biotinamidocaproamido-methyl)phenyl]-ethyl-[2-cyanoethyl-(N,N-diisopropyl)]-phosphoramidite,or similar linkers. In another set of embodiments, the fluorophore maybe conjugated to an oligonucleotide via a disulfide bond. The disulfidebond may be cleaved by a variety of reducing agents such as, but notlimited to, dithiothreitol, dithioerythritol, beta-mercaptoethanol,sodium borohydride, thioredoxin, glutaredoxin, trypsinogen, hydrazine,diisobutylaluminum hydride, oxalic acid, formic acid, ascorbic acid,phosphorous acid, tin chloride, glutathione, thioglycolate,2,3-dimercaptopropanol, 2-mercaptoethylamine, 2-aminoethanol,tris(2-carboxyethyl)phosphine, bis(2-mercaptoethyl) sulfone,N,N′-dimethyl-N,N′-bis(mercaptoacetyl)hydrazine, 3-mercaptoproptionate,dimethylformamide, thiopropyl-agarose, tri-n-butylphosphine, cysteine,iron sulfate, sodium sulfite, phosphite, hypophosphite,phosphorothioate, or the like, and/or combinations of any of these. Inanother embodiment, the fluorophore may be conjugated to anoligonucleotide via one or more phosphorothioate modified nucleotides inwhich the sulfur modification replaces the bridging and/or non-bridgingoxygen. The fluorophore may be cleaved from the oligonucleotide, incertain embodiments, via addition of compounds such as but not limitedto iodoethanol, iodine mixed in ethanol, silver nitrate, or mercurychloride. In yet another set of embodiments, the signaling entity may bechemically inactivated through reduction or oxidation. For example, inone embodiment, a chromophore such as Cy5 or Cy7 may be reduced usingsodium borohydride to a stable, non-fluorescence state. In still anotherset of embodiments, a fluorophore may be conjugated to anoligonucleotide via an azo bond, and the azo bond may be cleaved with2-[(2-N-arylamino)phenylazo]pyridine. In yet another set of embodiments,a fluorophore may be conjugated to an oligonucleotide via a suitablenucleic acid segment that can be cleaved upon suitable exposure toDNAse, e.g., an exodeoxyribonuclease or an endodeoxyribonuclease.Examples include, but are not limited to, deoxyribonuclease I ordeoxyribonuclease II. In one set of embodiments, the cleavage may occurvia a restriction endonuclease. Non-limiting examples of potentiallysuitable restriction endonucleases include BamHI, BsrI, NotI, XmaI,PspAI, DpnI, MboI, MnlI, Eco57I, Ksp632I, DraIII, AhaII, SmaI, MluI,HpaI, ApaI, BclI, BstEII, TaqI, EcoRI, SacI, HindIII, HaeII, DraI,Tsp509I, Sau3AI, PacI, etc. Over 3000 restriction enzymes have beenstudied in detail, and more than 600 of these are availablecommercially. In yet another set of embodiments, a fluorophore may beconjugated to biotin, and the oligonucleotide conjugated to avidin orstreptavidin. An interaction between biotin and avidin or streptavidinallows the fluorophore to be conjugated to the oligonucleotide, whilesufficient exposure to an excess of addition, free biotin could“outcompete” the linkage and thereby cause cleavage to occur. Inaddition, in another set of embodiments, the probes may be removed usingcorresponding “toe-hold-probes,” which comprise the same sequence as theprobe, as well as an extra number of bases of homology to the encodingprobes (e.g., 1-20 extra bases, for example, 5 extra bases). Theseprobes may remove the labeled readout probe through astrand-displacement interaction.

As used herein, the term “light” generally refers to electromagneticradiation, having any suitable wavelength (or equivalently, frequency).For instance, in some embodiments, the light may include wavelengths inthe optical or visual range (for example, having a wavelength of betweenabout 400 nm and about 700 nm, i.e., “visible light”), infraredwavelengths (for example, having a wavelength of between about 300micrometers and 700 nm), ultraviolet wavelengths (for example, having awavelength of between about 400 nm and about 10 nm), or the like. Incertain cases, as discussed in detail below, more than one entity may beused, i.e., entities that are chemically different or distinct, forexample, structurally. However, in other cases, the entities may bechemically identical or at least substantially chemically identical.

In one set of embodiments, the signaling entity is “switchable,” i.e.,the entity can be switched between two or more states, at least one ofwhich emits light having a desired wavelength. In the other state(s),the entity may emit no light, or emit light at a different wavelength.For instance, an entity may be “activated” to a first state able toproduce light having a desired wavelength, and “deactivated” to a secondstate not able to emit light of the same wavelength. An entity is“photoactivatable” if it can be activated by incident light of asuitable wavelength. As a non-limiting example, Cy5, can be switchedbetween a fluorescent and a dark state in a controlled and reversiblemanner by light of different wavelengths, i.e., 633 nm (or 642 nm, 647nm, 656 nm) red light can switch or deactivate Cy5 to a stable darkstate, while 405 nm green light can switch or activate the Cy5 back tothe fluorescent state. In some cases, the entity can be reversiblyswitched between the two or more states, e.g., upon exposure to theproper stimuli. For example, a first stimuli (e.g., a first wavelengthof light) may be used to activate the switchable entity, while a secondstimuli (e.g., a second wavelength of light) may be used to deactivatethe switchable entity, for instance, to a non-emitting state. Anysuitable method may be used to activate the entity. For example, in oneembodiment, incident light of a suitable wavelength may be used toactivate the entity to emit light, i.e., the entity is“photoswitchable.” Thus, the photoswitchable entity can be switchedbetween different light-emitting or non-emitting states by incidentlight, e.g., of different wavelengths. The light may be monochromatic(e.g., produced using a laser) or polychromatic. In another embodiment,the entity may be activated upon stimulation by electric field and/ormagnetic field. In other embodiments, the entity may be activated uponexposure to a suitable chemical environment, e.g., by adjusting the pH,or inducing a reversible chemical reaction involving the entity, etc.Similarly, any suitable method may be used to deactivate the entity, andthe methods of activating and deactivating the entity need not be thesame. For instance, the entity may be deactivated upon exposure toincident light of a suitable wavelength, or the entity may bedeactivated by waiting a sufficient time.

Typically, a “switchable” entity can be identified by one of ordinaryskill in the art by determining conditions under which an entity in afirst state can emit light when exposed to an excitation wavelength,switching the entity from the first state to the second state, e.g.,upon exposure to light of a switching wavelength, then showing that theentity, while in the second state can no longer emit light (or emitslight at a much reduced intensity) when exposed to the excitationwavelength.

In one set of embodiments, as discussed, a switchable entity may beswitched upon exposure to light. In some cases, the light used toactivate the switchable entity may come from an external source, e.g., alight source such as a laser light source, another light-emitting entityproximate the switchable entity, etc. The second, light emitting entity,in some cases, may be a fluorescent entity, and in certain embodiments,the second, light-emitting entity may itself also be a switchableentity.

In some embodiments, the switchable entity includes a first,light-emitting portion (e.g., a fluorophore), and a second portion thatactivates or “switches” the first portion. For example, upon exposure tolight, the second portion of the switchable entity may activate thefirst portion, causing the first portion to emit light. Examples ofactivator portions include, but are not limited to, Alexa Fluor 405(Invitrogen), Alexa Fluor 488 (Invitrogen), Cy2 (GE Healthcare), Cy3 (GEHealthcare), Cy3B (GE Healthcare), Cy3.5 (GE Healthcare), or othersuitable dyes. Examples of light-emitting portions include, but are notlimited to, Cy5, Cy5.5 (GE Healthcare), Cy7 (GE Healthcare), Alexa Fluor647 (Invitrogen), Alexa Fluor 680 (Invitrogen), Alexa Fluor 700(Invitrogen), Alexa Fluor 750 (Invitrogen), Alexa Fluor 790(Invitrogen), DiD, DiR, YOYO-3 (Invitrogen), YO-PRO-3 (Invitrogen),TOT-3 (Invitrogen), TO-PRO-3 (Invitrogen) or other suitable dyes. Thesemay linked together, e.g., covalently, for example, directly, or througha linker, e.g., forming compounds such as, but not limited to, Cy5-AlexaFluor 405, Cy5-Alexa Fluor 488, Cy5-Cy2, Cy5-Cy3, Cy5-Cy3.5, Cy5.5-AlexaFluor 405, Cy5.5-Alexa Fluor 488, Cy5.5-Cy2, Cy5.5-Cy3, Cy5.5-Cy3.5,Cy7-Alexa Fluor 405, Cy7-Alexa Fluor 488, Cy7-Cy2, Cy7-Cy3, Cy7-Cy3.5,Alexa Fluor 647-Alexa Fluor 405, Alexa Fluor 647-Alexa Fluor 488, AlexaFluor 647-Cy2, Alexa Fluor 647-Cy3, Alexa Fluor 647-Cy3.5, Alexa Fluor750-Alexa Fluor 405, Alexa Fluor 750-Alexa Fluor 488, Alexa Fluor750-Cy2, Alexa Fluor 750-Cy3, or Alexa Fluor 750-Cy3.5. Those ofordinary skill in the art will be aware of the structures of these andother compounds, many of which are available commercially. The portionsmay be linked via a covalent bond, or by a linker, such as thosedescribed in detail below. Other light-emitting or activator portionsmay include portions having two quaternized nitrogen atoms joined by apolymethine chain, where each nitrogen is independently part of aheteroaromatic moiety, such as pyrrole, imidazole, thiazole, pyridine,quinoline, indole, benzothiazole, etc., or part of a nonaromatic amine.In some cases, there may be 5, 6, 7, 8, 9, or more carbon atoms betweenthe two nitrogen atoms.

In certain cases, the light-emitting portion and the activator portions,when isolated from each other, may each be fluorophores, i.e., entitiesthat can emit light of a certain, emission wavelength when exposed to astimulus, for example, an excitation wavelength. However, when aswitchable entity is formed that comprises the first fluorophore and thesecond fluorophore, the first fluorophore forms a first, light-emittingportion and the second fluorophore forms an activator portion thatswitches that activates or “switches” the first portion in response to astimulus. For example, the switchable entity may comprise a firstfluorophore directly bonded to the second fluorophore, or the first andsecond entity may be connected via a linker or a common entity. Whethera pair of light-emitting portion and activator portion produces asuitable switchable entity can be tested by methods known to those ofordinary skills in the art. For example, light of various wavelength canbe used to stimulate the pair and emission light from the light-emittingportion can be measured to determined wither the pair makes a suitableswitch.

As a non-limiting example, Cy3 and Cy5 may be linked together to formsuch an entity. In this example, Cy3 is an activator portion that isable to activate Cy5, the light-emission portion. Thus, light at or nearthe absorption maximum (e.g., near 532 nm light for Cy3) of theactivation or second portion of the entity may cause that portion toactivate the first, light-emitting portion, thereby causing the firstportion to emit light (e.g., near 647 nm for Cy5). See, e.g., U.S. Pat.No. 7,838,302, incorporated herein by reference in its entirety. In somecases, the first, light-emitting portion can subsequently be deactivatedby any suitable technique (e.g., by directing 647 nm red light to theCy5 portion of the molecule).

Other non-limiting examples of potentially suitable activator portionsinclude 1,5 IAEDANS, 1,8-ANS, 4-Methylumbelliferone,5-carboxy-2,7-dichlorofluorescein, 5-Carboxyfluorescein (5-FAM),5-Carboxynapthofluorescein, 5-Carboxytetramethylrhodamine (5-TAMRA),5-FAM (5-Carboxyfluorescein), 5-HAT (Hydroxy Tryptamine), 5-HydroxyTryptamine (HAT), 5-ROX (carboxy-X-rhodamine), 5-TAMRA(5-Carboxytetramethylrhodamine), 6-Carboxyrhodamine 6G, 6-CR 6G, 6-JOE,7-Amino-4-methylcoumarin, 7-Aminoactinomycin D (7-AAD),7-Hydroxy-4-methylcoumarin, 9-Amino-6-chloro-2-methoxyacridine, ABQ,Acid Fuchsin, ACMA (9-Amino-6-chloro-2-methoxyacridine), AcridineOrange, Acridine Red, Acridine Yellow, Acriflavin, Acriflavin FeulgenSITSA, Alexa Fluor 350, Alexa Fluor 405, Alexa Fluor 430, Alexa Fluor488, Alexa Fluor 500, Alexa Fluor 514, Alexa Fluor 532, Alexa Fluor 546,Alexa Fluor 555, Alexa Fluor 568, Alexa Fluor 594, Alexa Fluor 610,Alexa Fluor 633, Alexa Fluor 635, Alizarin Complexon, Alizarin Red, AMC,AMCA-S, AMCA (Aminomethylcoumarin), AMCA-X, Aminoactinomycin D,Aminocoumarin, Aminomethylcoumarin (AMCA), Anilin Blue, Anthrocylstearate, APTRA-BTC, APTS, Astrazon Brilliant Red 4G, Astrazon Orange R,Astrazon Red 6B, Astrazon Yellow 7 GLL, Atabrine, ATTO 390, ATTO 425,ATTO 465, ATTO 488, ATTO 495, ATTO 520, ATTO 532, ATTO 550, ATTO 565,ATTO 590, ATTO 594, ATTO 610, ATTO 611X, ATTO 620, ATTO 633, ATTO 635,ATTO 647, ATTO 647N, ATTO 655, ATTO 680, ATTO 700, ATTO 725, ATTO 740,ATTO-TAG CBQCA, ATTO-TAG FQ, Auramine, Aurophosphine G, Aurophosphine,BAO 9 (Bisaminophenyloxadiazole), BCECF (high pH), BCECF (low pH),Berberine Sulphate, Bimane, Bisbenzamide, Bisbenzimide (Hoechst),bis-BTC, Blancophor FFG, Blancophor SV, BOBO-1, BOBO-3, Bodipy 492/515,Bodipy 493/503, Bodipy 500/510, Bodipy 505/515, Bodipy 530/550, Bodipy542/563, Bodipy 558/568, Bodipy 564/570, Bodipy 576/589, Bodipy 581/591,Bodipy 630/650-X, Bodipy 650/665-X, Bodipy 665/676, Bodipy Fl, Bodipy FLATP, Bodipy Fl-Ceramide, Bodipy R6G, Bodipy TMR, Bodipy TMR-X conjugate,Bodipy TMR-X, SE, Bodipy TR, Bodipy TR ATP, Bodipy TR-X SE, BO-PRO-1,BO-PRO-3, Brilliant Sulphoflavin FF, BTC, BTC-5N, Calcein, Calcein Blue,Calcium Crimson, Calcium Green, Calcium Green-1 Ca²⁺ Dye, CalciumGreen-2 Ca²⁺, Calcium Green-5N Ca²⁺, Calcium Green-C18 Ca²⁺, CalciumOrange, Calcofluor White, Carboxy-X-rhodamine (5-ROX), Cascade Blue,Cascade Yellow, Catecholamine, CCF2 (GeneBlazer), CFDA, Chromomycin A,Chromomycin A, CL-NERF, CMFDA, Coumarin Phalloidin, CPM Methylcoumarin,CTC, CTC Formazan, Cy2, Cy3.1 8, Cy3.5, Cy3, Cy5.1 8, cyclic AMPFluorosensor (FiCRhR), Dabcyl, Dansyl, Dansyl Amine, Dansyl Cadaverine,Dansyl Chloride, Dansyl DHPE, Dansyl fluoride, DAPI, Dapoxyl, Dapoxyl 2,Dapoxyl 3′ DCFDA, DCFH (Dichlorodihydrofluorescein Diacetate), DDAO, DHR(Dihydorhodamine 123), Di-4-ANEPPS, Di-8-ANEPPS (non-ratio), DiA(4-Di-16-ASP), Dichlorodihydrofluorescein Diacetate (DCFH),DiD—Lipophilic Tracer, DiD (DiIC18(5)), DIDS, Dihydorhodamine 123 (DHR),DiI (DiIC18(3)), Dinitrophenol, DiO (DiOC18(3)), DiR, DiR (DiIC18(7)),DM-NERF (high pH), DNP, Dopamine, DTAF, DY-630-NHS, DY-635-NHS, DyLight405, DyLight 488, DyLight 549, DyLight 633, DyLight 649, DyLight 680,DyLight 800, ELF 97, Eosin, Erythrosin, Erythrosin ITC, EthidiumBromide, Ethidium homodimer-1 (EthD-1), Euchrysin, EukoLight, Europium(III) chloride, Fast Blue, FDA, Feulgen (Pararosaniline), FIF(Formaldehyd Induced Fluorescence), FITC, Flazo Orange, Fluo-3, Fluo-4,Fluorescein (FITC), Fluorescein Diacetate, Fluoro-Emerald, Fluoro-Gold(Hydroxystilbamidine), Fluor-Ruby, FluorX, FM 1-43, FM 4-46, Fura Red(high pH), Fura Red/Fluo-3, Fura-2, Fura-2/BCECF, Genacryl Brilliant RedB, Genacryl Brilliant Yellow 10GF, Genacryl Pink 3G, Genacryl Yellow5GF, GeneBlazer (CCF2), Gloxalic Acid, Granular blue, Haematoporphyrin,Hoechst 33258, Hoechst 33342, Hoechst 34580, HPTS, Hydroxycoumarin,Hydroxystilbamidine (FluoroGold), Hydroxytryptamine, Indo-1, highcalcium, Indo-1, low calcium, Indodicarbocyanine (DiD),Indotricarbocyanine (DiR), Intrawhite Cf, JC-1, JO-JO-1, JO-PRO-1,LaserPro, Laurodan, LDS 751 (DNA), LDS 751 (RNA), Leucophor PAF,Leucophor SF, Leucophor WS, Lissamine Rhodamine, Lissamine Rhodamine B,Calcein/Ethidium homodimer, LOLO-1, LO-PRO-1, Lucifer Yellow, LysoTracker Blue, Lyso Tracker Blue-White, Lyso Tracker Green, Lyso TrackerRed, Lyso Tracker Yellow, LysoSensor Blue, LysoSensor Green, LysoSensorYellow/Blue, Mag Green, Magdala Red (Phloxin B), Mag-Fura Red,Mag-Fura-2, Mag-Fura-5, Mag-Indo-1, Magnesium Green, Magnesium Orange,Malachite Green, Marina Blue, Maxilon Brilliant Flavin 10 GFF, MaxilonBrilliant Flavin 8 GFF, Merocyanin, Methoxycoumarin, Mitotracker GreenFM, Mitotracker Orange, Mitotracker Red, Mitramycin, Monobromobimane,Monobromobimane (mBBr-GSH), Monochlorobimane, MPS (Methyl Green PyronineStilbene), NBD, NBD Amine, Nile Red, Nitrobenzoxadidole, Noradrenaline,Nuclear Fast Red, Nuclear Yellow, Nylosan Brilliant Iavin E8G, OregonGreen, Oregon Green 488-X, Oregon Green, Oregon Green 488, Oregon Green500, Oregon Green 514, Pacific Blue, Pararosaniline (Feulgen), PBFI,Phloxin B (Magdala Red), Phorwite AR, Phorwite BKL, Phorwite Rev,Phorwite RPA, Phosphine 3R, PKH26 (Sigma), PKH67, PMIA, Pontochrome BlueBlack, POPO-1, POPO-3, PO-PRO-1, PO-PRO-3, Primuline, Procion Yellow,Propidium Iodid (PI), PyMPO, Pyrene, Pyronine, Pyronine B, PyrozalBrilliant Flavin 7GF, QSY 7, Quinacrine Mustard, Resorufin, RH 414,Rhod-2, Rhodamine, Rhodamine 110, Rhodamine 123, Rhodamine 5 GLD,Rhodamine 6G, Rhodamine B, Rhodamine B 200, Rhodamine B extra, RhodamineBB, Rhodamine BG, Rhodamine Green, Rhodamine Phallicidine, RhodaminePhalloidine, Rhodamine Red, Rhodamine WT, Rose Bengal, S65A, S65C, S65L,S65T, SBFI, Serotonin, Sevron Brilliant Red 2B, Sevron Brilliant Red 4G,Sevron Brilliant Red B, Sevron Orange, Sevron Yellow L, SITS, SITS(Primuline), SITS (Stilbene Isothiosulphonic Acid), SNAFL calcein,SNAFL-1, SNAFL-2, SNARF calcein, SNARFI, Sodium Green, SpectrumAqua,SpectrumGreen, SpectrumOrange, Spectrum Red, SPQ(6-methoxy-N-(3-sulfopropyl)quinolinium), Stilbene, Sulphorhodamine Bcan C, Sulphorhodamine Extra, SYTO 11, SYTO 12, SYTO 13, SYTO 14, SYTO15, SYTO 16, SYTO 17, SYTO 18, SYTO 20, SYTO 21, SYTO 22, SYTO 23, SYTO24, SYTO 25, SYTO 40, SYTO 41, SYTO 42, SYTO 43, SYTO 44, SYTO 45, SYTO59, SYTO 60, SYTO 61, SYTO 62, SYTO 63, SYTO 64, SYTO 80, SYTO 81, SYTO82, SYTO 83, SYTO 84, SYTO 85, SYTOX Blue, SYTOX Green, SYTOX Orange,Tetracycline, Tetramethylrhodamine (TAMRA), Texas Red, Texas Red-Xconjugate, Thiadicarbocyanine (DiSC3), Thiazine Red R, Thiazole Orange,Thioflavin 5, Thioflavin S, Thioflavin TCN, Thiolyte, Thiozole Orange,Tinopol CBS (Calcofluor White), TMR, TO-PRO-1, TO-PRO-3, TO-PRO-5,TOTO-1, TOTO-3, TRITC (tetramethylrodamine isothiocyanate), True Blue,TruRed, Ultralite, Uranine B, Uvitex SFC, WW 781, X-Rhodamine, XRITC,Xylene Orange, Y66F, Y66H, Y66W, YO-PRO-1, YO-PRO-3, YOYO-1, YOYO-3,SYBR Green, Thiazole orange (interchelating dyes), or combinationsthereof.

Another aspect of the invention is directed to a computer-implementedmethod. For instance, a computer and/or an automated system may beprovided that is able to automatically and/or repetitively perform anyof the methods described herein. As used herein, “automated” devicesrefer to devices that are able to operate without human direction, i.e.,an automated device can perform a function during a period of time afterany human has finished taking any action to promote the function, e.g.by entering instructions into a computer to start the process.Typically, automated equipment can perform repetitive functions afterthis point in time. The processing steps may also be recorded onto amachine-readable medium in some cases.

For example, in some cases, a computer may be used to control imaging ofthe sample, e.g., using fluorescence microscopy, STORM or othersuper-resolution techniques such as those described herein. In somecases, the computer may also control operations such as driftcorrection, physical registration, hybridization and cluster alignmentin image analysis, cluster decoding (e.g., fluorescent clusterdecoding), error detection or correction (e.g., as discussed herein),noise reduction, identification of foreground features from backgroundfeatures (such as noise or debris in images), or the like. As anexample, the computer may be used to control activation and/orexcitation of signaling entities within the sample, and/or theacquisition of images of the signaling entities. In one set ofembodiments, a sample may be excited using light having variouswavelengths and/or intensities, and the sequence of the wavelengths oflight used to excite the sample may be correlated, using a computer, tothe images acquired of the sample containing the signaling entities. Forinstance, the computer may apply light having various wavelengths and/orintensities to a sample to yield different average numbers of signalingentities in each region of interest (e.g., one activated entity perlocation, two activated entities per location, etc.). In some cases,this information may be used to construct an image and/or determine thelocations of the signaling entities, in some cases at high resolutions,as noted above.

In some aspects, the sample is positioned on a microscope. In somecases, the microscope may contain one or more channels, such asmicrofluidic channels, to direct or control fluid to or from the sample.For instance, in one embodiment, nucleic acid probes such as thosediscussed herein may be introduced and/or removed from the sample byflowing fluid through one or more channels to or from the sample. Insome cases, there may also be one or more chambers or reservoirs forholding fluid, e.g., in fluidic communication with the channel, and/orwith the sample. Those of ordinary skill in the art will be familiarwith channels, including microfluidic channels, for moving fluid to orfrom a sample.

As used herein, “microfluidic,” “microscopic,” “microscale,” the“micro-” prefix (for example, as in “microchannel”), and the likegenerally refers to elements or articles having widths or diameters ofless than about 1 mm, and less than about 100 microns (micrometers) insome cases. In some embodiments, larger channels may be used instead of,or in conjunction with, microfluidic channels for any of the embodimentsdiscussed herein. For examples, channels having widths or diameters ofless than about 10 mm, less than about 9 mm, less than about 8 mm, lessthan about 7 mm, less than about 6 mm, less than about 5 mm, less thanabout 4 mm, less than about 3 mm, or less than about 2 mm may be used incertain instances. In some cases, the element or article includes achannel through which a fluid can flow. In all embodiments, specifiedwidths can be a smallest width (i.e. a width as specified where, at thatlocation, the article can have a larger width in a different dimension),or a largest width (i.e. where, at that location, the article has awidth that is no wider than as specified, but can have a length that isgreater). Thus, for instance, the microfluidic channel may have anaverage cross-sectional dimension (e.g., perpendicular to the directionof flow of fluid in the microfluidic channel) of less than about 1 mm,less than about 500 microns, less than about 300 microns, or less thanabout 100 microns. In some cases, the microfluidic channel may have anaverage diameter of less than about 60 microns, less than about 50microns, less than about 40 microns, less than about 30 microns, lessthan about 25 microns, less than about 10 microns, less than about 5microns, less than about 3 microns, or less than about 1 micron.

A “channel,” as used herein, means a feature on or in an article (e.g.,a substrate) that at least partially directs the flow of a fluid. Insome cases, the channel may be formed, at least in part, by a singlecomponent, e.g. an etched substrate or molded unit. The channel can haveany cross-sectional shape, for example, circular, oval, triangular,irregular, square or rectangular (having any aspect ratio), or the like,and can be covered or uncovered (i.e., open to the external environmentsurrounding the channel). In embodiments where the channel is completelycovered, at least one portion of the channel can have a cross-sectionthat is completely enclosed, and/or the entire channel may be completelyenclosed along its entire length with the exception of its inlet andoutlet.

A channel may have any aspect ratio, e.g., an aspect ratio (length toaverage cross-sectional dimension) of at least about 2:1, more typicallyat least about 3:1, at least about 5:1, at least about 10:1, etc. Asused herein, a “cross-sectional dimension,” in reference to a fluidic ormicrofluidic channel, is measured in a direction generally perpendicularto fluid flow within the channel. A channel generally will includecharacteristics that facilitate control over fluid transport, e.g.,structural characteristics and/or physical or chemical characteristics(hydrophobicity vs. hydrophilicity) and/or other characteristics thatcan exert a force (e.g., a containing force) on a fluid. The fluidwithin the channel may partially or completely fill the channel. In somecases the fluid may be held or confined within the channel or a portionof the channel in some fashion, for example, using surface tension(e.g., such that the fluid is held within the channel within a meniscus,such as a concave or convex meniscus). In an article or substrate, some(or all) of the channels may be of a particular size or less, forexample, having a largest dimension perpendicular to fluid flow of lessthan about 5 mm, less than about 2 mm, less than about 1 mm, less thanabout 500 microns, less than about 200 microns, less than about 100microns, less than about 60 microns, less than about 50 microns, lessthan about 40 microns, less than about 30 microns, less than about 25microns, less than about 10 microns, less than about 3 microns, lessthan about 1 micron, less than about 300 nm, less than about 100 nm,less than about 30 nm, or less than about 10 nm or less in some cases.In one embodiment, the channel is a capillary.

A variety of materials and methods, according to certain aspects of theinvention, can be used to form devices or components containingmicrofluidic channels, chambers, etc. For example, various devices orcomponents can be formed from solid materials, in which the channels canbe formed via micromachining, film deposition processes such as spincoating and chemical vapor deposition, physical vapor deposition, laserfabrication, photolithographic techniques, etching methods including wetchemical or plasma processes, electrodeposition, and the like. See, forexample, Scientific American, 248:44-55, 1983 (Angell, et al).

In one set of embodiments, various structures or components can beformed of a polymer, for example, an elastomeric polymer such aspolydimethylsiloxane (“PDMS”), polytetrafluoroethylene (“PTFE” orTeflon®), or the like. For instance, according to one embodiment, achannel such as a microfluidic channel may be implemented by fabricatingthe fluidic system separately using PDMS or other soft lithographytechniques (details of soft lithography techniques suitable for thisembodiment are discussed in the references entitled “Soft Lithography,”by Younan Xia and George M. Whitesides, published in the Annual Reviewof Material Science, 1998, Vol. 28, pages 153-184, and “Soft Lithographyin Biology and Biochemistry,” by George M. Whitesides, Emanuele Ostuni,Shuichi Takayama, Xingyu Jiang and Donald E. Ingber, published in theAnnual Review of Biomedical Engineering, 2001, Vol. 3, pages 335-373;each of these references is incorporated herein by reference).

Other examples of potentially suitable polymers include, but are notlimited to, polyethylene terephthalate (PET), polyacrylate,polymethacrylate, polycarbonate, polystyrene, polyethylene,polypropylene, polyvinylchloride, cyclic olefin copolymer (COC),polytetrafluoroethylene, a fluorinated polymer, a silicone such aspolydimethylsiloxane, polyvinylidene chloride, bis-benzocyclobutene(“BCB”), a polyimide, a fluorinated derivative of a polyimide, or thelike. Combinations, copolymers, or blends involving polymers includingthose described above are also envisioned. The device may also be formedfrom composite materials, for example, a composite of a polymer and asemiconductor material.

In some embodiments, various microfluidic structures or components ofthe device are fabricated from polymeric and/or flexible and/orelastomeric materials, and can be conveniently formed of a hardenablefluid, facilitating fabrication via molding (e.g. replica molding,injection molding, cast molding, etc.). The hardenable fluid can beessentially any fluid that can be induced to solidify, or thatspontaneously solidifies, into a solid capable of containing and/ortransporting fluids contemplated for use in and with the fluidicnetwork. In one embodiment, the hardenable fluid comprises a polymericliquid or a liquid polymeric precursor (i.e. a “prepolymer”). Suitablepolymeric liquids can include, for example, thermoplastic polymers,thermoset polymers, waxes, metals, or mixtures or composites thereofheated above their melting point. As another example, a suitablepolymeric liquid may include a solution of one or more polymers in asuitable solvent, which solution forms a solid polymeric material uponremoval of the solvent, for example, by evaporation. Such polymericmaterials, which can be solidified from, for example, a melt state or bysolvent evaporation, are well known to those of ordinary skill in theart. A variety of polymeric materials, many of which are elastomeric,are suitable, and are also suitable for forming molds or mold masters,for embodiments where one or both of the mold masters is composed of anelastomeric material. A non-limiting list of examples of such polymersincludes polymers of the general classes of silicone polymers, epoxypolymers, and acrylate polymers. Epoxy polymers are characterized by thepresence of a three-membered cyclic ether group commonly referred to asan epoxy group, 1,2-epoxide, or oxirane. For example, diglycidyl ethersof bisphenol A can be used, in addition to compounds based on aromaticamine, triazine, and cycloaliphatic backbones. Another example includesthe well-known Novolac polymers. Non-limiting examples of siliconeelastomers suitable for use according to the invention include thoseformed from precursors including the chlorosilanes such asmethylchlorosilanes, ethylchlorosilanes, phenylchlorosilanes, etc.

Silicone polymers are used in certain embodiments, for example, thesilicone elastomer polydimethylsiloxane. Non-limiting examples of PDMSpolymers include those sold under the trademark Sylgard by Dow ChemicalCo., Midland, Mich., and particularly Sylgard 182, Sylgard 184, andSylgard 186. Silicone polymers including PDMS have several beneficialproperties simplifying fabrication of various structures of theinvention. For instance, such materials are inexpensive, readilyavailable, and can be solidified from a prepolymeric liquid via curingwith heat. For example, PDMSs are typically curable by exposure of theprepolymeric liquid to temperatures of about, for example, about 65° C.to about 75° C. for exposure times of, for example, at least about anhour. Also, silicone polymers, such as PDMS, can be elastomeric and thusmay be useful for forming very small features with relatively highaspect ratios, necessary in certain embodiments of the invention.Flexible (e.g., elastomeric) molds or masters can be advantageous inthis regard.

One advantage of forming structures such as microfluidic structures orchannels from silicone polymers, such as PDMS, is the ability of suchpolymers to be oxidized, for example by exposure to an oxygen-containingplasma such as an air plasma, so that the oxidized structures contain,at their surface, chemical groups capable of cross-linking to otheroxidized silicone polymer surfaces or to the oxidized surfaces of avariety of other polymeric and non-polymeric materials. Thus, structurescan be fabricated and then oxidized and essentially irreversibly sealedto other silicone polymer surfaces, or to the surfaces of othersubstrates reactive with the oxidized silicone polymer surfaces, withoutthe need for separate adhesives or other sealing means. In most cases,sealing can be completed simply by contacting an oxidized siliconesurface to another surface without the need to apply auxiliary pressureto form the seal. That is, the pre-oxidized silicone surface acts as acontact adhesive against suitable mating surfaces. Specifically, inaddition to being irreversibly sealable to itself, oxidized siliconesuch as oxidized PDMS can also be sealed irreversibly to a range ofoxidized materials other than itself including, for example, glass,silicon, silicon oxide, quartz, silicon nitride, polyethylene,polystyrene, glassy carbon, and epoxy polymers, which have been oxidizedin a similar fashion to the PDMS surface (for example, via exposure toan oxygen-containing plasma). Oxidation and sealing methods useful inthe context of the present invention, as well as overall moldingtechniques, are described in the art, for example, in an articleentitled “Rapid Prototyping of Microfluidic Systems andPolydimethylsiloxane,” Anal. Chem., 70:474-480, 1998 (Duffy et al.),incorporated herein by reference.

The following documents are each incorporated herein by reference intheir entireties: U.S. Pat. No. 7,838,302, issued Nov. 23, 2010,entitled “Sub-Diffraction Limit Image Resolution and Other ImagingTechniques,” by Zhuang, et al.; U.S. Pat. No. 8,564,792, issued Oct. 22,2013, entitled “Sub-diffraction Limit Image Resolution in ThreeDimensions,” by Zhuang, et al.; and Int. Pat. Apl. Pub. No. WO2013/090360, published Jun. 20, 2013, entitled “High ResolutionDual-Objective Microscopy,” by Zhuang, et al.

In addition, incorporated herein by reference in their entireties areU.S. Provisional Patent Application Ser. No. 62/031,062, filed Jul. 30,2014, entitled “Systems and Methods for Determining Nucleic Acids,” byZhuang, et al.; U.S. Provisional Patent Application Ser. No. 62/050,636,filed Sep. 15, 2014, entitled “Probe Library Construction,” by Zhuang,et al.; U.S. Provisional Patent Application Ser. No. 62/142,653, filedApr. 3, 2015, entitled “Systems and Methods for Determining NucleicAcids,” by Zhuang, et al.; and a PCT application filed on even dateherewith, entitled “Probe Library Construction,” by Zhuang, et al.

The following examples are intended to illustrate certain embodiments ofthe present invention, but do not exemplify the full scope of theinvention.

Example 1

The example presents a platform to enable the simultaneous detection ofthe number and spatial organization of thousands of distinct mRNAswithin single cells with high efficiency and low error-rate using anovel form of highly multiplexed fluorescence in situ hybridization(FISH). This example accomplishes these measurements by integrating andinnovating methods for massively parallel probe synthesis,super-resolution imaging, and self-correcting error-checking codes.

Here, these examples present methods for the simultaneous detection ofsome or all of the thousands of unique RNAs expressed in a cell. Thisapproach not only promises to revolutionize the throughput of thealready effective single-molecule FISH (smFISH) approach, but alsoallows researchers to benefit from the hypothesis free discoveryapproach which has made other whole-genome systems approaches to biologyso effective. For example, this whole genome approach may allowresearchers to discover RNAs whose expression levels and/or subcellularlocalization patterns change under certain conditions of interest, suchas disease states, without knowing, a priori, which mRNA will change inabundance or localization. Simultaneous measurements of hundreds ofgenes within a single cell also allow for the identification ofcorrelations between genes in expression and localization patterns insome cases.

This can be achieved using methods for highly multiplexed smFISH via thesequential hybridization of orthogonal detection probes andsuper-resolution imaging, reducing the cost of probe synthesis, and thedevelopment of a highly automated system to minimize demands on theuser, as discussed herein. This provides an integrated platform tohandle the bioinformatics of probe design, the mathematics oferror-correcting codes, the complexity of image registration andanalysis, and the cumbersome fluid handling through a simple suite ofuser friendly interfaces. This integration allows easy operation withlimited user training and facilitates the rapid collection of data.

This example illustrates: (1) computational design of “codewords” toattach to all RNA targets in the cell that will allow uniqueidentification of each RNA with some degree of experimental errortolerance, (2) translation of these codewords into nucleotide sequencesand synthesis of the required single-stranded (ss) oligonucleotide (e.g.ssDNA) probes, (3) sample fixation and hybridization of these probes tothe RNA targets in situ, (4) read-out of these codewords via successiverounds of hybridization of distinct fluorescent probes imaged withconventional fluorescence microscopy or super-resolution fluorescencemicroscopy, and (5) automated decoding of measured codewords combinedwith computational error correction to uniquely and robustly identifyindividual mRNAs.

In the first step, a “codeword” is assigned to every RNA that is to belabeled. In a typical design these may be strings of N binary letters orpositions. Codewords may be chosen from the same wide range of existingerror tolerant or error-correcting encoding schemes developed fordigital storage and communication, e.g., using Hamming codes or thelike. For example, actin-RNA may be assigned the binary codeword11001010. Each codeword may be unique and separated from the othercodewords by a Hamming distance, h, which measures the number of lettersor positions that must be incorrectly read for one codeword to bemisinterpreted as a different one. A Hamming distance greater than 1between all codewords allows for some measurement errors to bedetected-since simple errors would produce codewords that are not usedto encode RNAs. For a Hamming distance larger than 2, it is alsopossible to correct some errors, as codewords with one error will beclosest in Hamming distance to a single, unique codeword. The totalnumber of different RNAs to be detected from the transcriptome and theamount of error correction desired determines the length of thecodewords. Information theory provides several efficient algorithms forassembling error-correcting binary codebooks.

In the second step, this encoding scheme is translated into a set ofoligonucleotide (e.g. DNA) probe sequences, which can be called primaryprobes or encoding probes, each of which not only targets a probe to theRNA of interest but also encodes the unique binary codeword within a setof secondary binding sites (FIGS. 1A-1C). For example, first designedmay be primary binding sequences for each targeted mRNA. These sequencesare “target sequences” that are comprised of complementary nucleotidesequences to their target RNAs computationally selected to satisfy astringent set of hybridization conditions, including uniqueness in thetarget genome. To improve the efficiency of hybridization to individualmRNAs, multiple primary target sequences are designed for eachindividual RNA. Then each position within the set of codewords isassigned a unique oligonucleotide (e.g. DNA) sequence, which is called aread sequence. These tags are designed as to have no interaction withendogenous mRNA sequences or each other. For instance, for all the value“1”s in a codeword of an individual mRNA, the corresponding readsequence is attached to the primary targeting sequences against thatmRNA. In general, each probe will contain a target sequence and one ormore read sequences. If the total length of the necessary read sequencesand the primary target sequence exceeds synthesis capabilities, thensubsets of the read sequences can be appended to distinct targetsequences. For example, consider the potential codeword 11001010 foractin. Probe sequences for this RNA could contain the read sequencescorresponding to positions 1, 2, 5, and 7 in the codeword attached to avariety of target sequences specific to actin. After all the sequenceshave been designed, the resulting complex set of unique customoligonucleotide (e.g. DNA) sequences is manufactured and amplified usingmethods as described below.

In the third step, the resulting pool of DNA is hybridized, e.g., tofixed, permeabilized cells. In this process, individual probes may beattached to every RNA in the cell by hybridization of theircorresponding target sequences with the RNA while the read sequencesremain free to bind the appropriate secondary probes as discussed below.

In the fourth step—the read-out step—fluorescently labeled secondarynucleic acid probes (also called readout probes) are successivelyhybridized to the read sequences attached to the target sequences thatbinds to the mRNA targets in the above step. When simultaneously imaginga large number of different RNA species in cells, the density of labeledRNAs may exceed that at which each RNA can be resolved via conventionalimaging methods. Thus, this may be performed using a super-resolutionimaging method, for example STORM (stochastic optical reconstructionmicroscopy), to resolve the labeled molecules. After each round ofhybridization and imaging with the secondary probes, the fluorophoresare quenched or otherwise inactivated either via chemical or opticaltechniques such as oxidation, chemically bleaching, photobleaching,stringent washing or enzymatic digestion, etc. The sample is thenstained with the next secondary probe, and the cycle continues until allpositions of the codewords have been read out. In the simplestincarnation, there will be one hybridization step for each positionwithin the codeword, e.g. 8 hybridization steps for an 8-letter codeword(FIGS. 1A-1C).

FIGS. 1A-1C show schematic diagrams of this example. FIG. 1A shows thatevery position of the codewords is assigned a unique oligonucleotidesequence when this position has a value “1.” All mRNA codewords are thentranslated into combinations of read sequences, which are attached tothe targeting sequence. FIG. 1B shows various steps of the labelingscheme of this example. In the first step, all mRNAs (I-III) are taggedwith multiple oligonucleotide (e.g. ssDNA) probes comprising a primarytargeting sequence, which hybridizes to the RNA of interest, and a“tail” (i.e. containing read sequences) carrying the translatedcodeword, which does not interact with endogenous nucleotide sequences.In the next step, the first secondary probe is added, which can bind allprobes whose tails have a read sequence corresponding to the value of“1” in the first position. The dyes on these secondary probes are imagedand bleached, then the next secondary probe is added to bind probesattached to mRNA which have a value of “1” in the second position oftheir assigned codeword, and so on.

In the final step, the microscopy images from each staining and imaginground are aligned, for example, computationally (e.g. using fiducialbeads or other markers tracked during image acquisition), and theclusters of localizations resolved by conventional fluorescencemicroscopy or super-resolution imaging (e.g. STORM) from the differentrounds are identified. These clusters of localizations arise fromindividual target mRNA molecules, and the hybridization rounds in whicha spot was detected in a given cluster correspond to the “1” in thecodeword for that mRNA. If there are no missed-detection events or falsepositive signal in the images, this codeword will perfectly match one ofthe expected codewords. FIGS. 1A-1C describe an example in which thecodeword has three letters, i.e. three positions, and the three targetmRNAs have codewords 110, 101, and 011 assigned to them. In realexperimental examples, the codeword could contain more digits. Forexample, the mRNA for actin can be assigned the codeword, 11001010. Inthat case, detected clusters containing overlapping localization signalsin the 1st, 2nd, 5th and 7th hybridization steps (meaning the 1st, 2nd,5th and 7th secondary probes bound to this site) can be identified asindividual actin mRNA molecules, since the pattern of positive bindingsmatches the codewode of actin (11001010). In addition, if there aremissed-detection events or false positive signals in the image data,these aberrations can be corrected by the implemented error-correctionscheme. For example, clusters of localizations with a detected codewordthat has only one digit discrepancy from 11001010 (such as 11000010 or11101010) can also be identified as actin mRNA since all other validcodeword in this example differ from the detected pattern in two morepositions.

Example 2

This example describes another alternative approach that differs inseveral of the steps described above. This approach begins with thefirst step, construction of the codewords to the desired mRNA targets,as described above.

In the second step of this approach, nucleic acid probes are designedthat bind uniquely to the mRNA targets of interest, as described above.However, instead of appending unique read sequences to these targetingsequences, unique pools or groups of probes are constructed from thesetarget sequences. Each pool comprises all or a subset of the sequencesthat target all mRNAs which contain the same value at a given positionin their codeword. For example, the first pool would have of all or asubset of the target sequences designed for all mRNAs that contain a 1in the first position of their codewords, e.g. 110 and 101 but not 011;the second pool would have of all or a subset of the target sequencesdesigned for all mRNAs that contain a 1 in the second position of theircodewords, e.g. 110 and 011 but not 101; the third pool would have ofall or a subset of the target sequences designed for all mRNAs thatcontain a 1 in the third position of their codewords, e.g. 011 and 101but not 110 (FIG. 1C). As another example, consider the potentialcodeword 11001010 for actin. Probes that target this mRNA would beincluded in pools 1, 2, 5, and 7 but not in pools 3, 4, 6, and 8. Thesame target to a given mRNA may or may not be included in pools. Forexample, a probe that targets the same region of actin may be includedin pools 1, 2, 5, and 7 or any subset of these pools. After all of thepools have been designed, each complex set of unique, customoligonucleotide sequences is manufactured and amplified using methods asdescribed below.

In the third step of this approach, the first pool of probes ishybridized, e.g., to fixed, permeabilized cells. In this process, thefluorophores attached to each of the probes in this pool are bound toeach of the targets of that pool. The binding of these probes is thendetermined by fluorescence microscopy. As described above, these imagescan be collected either via a range of methods including bothconventional fluorescence imaging or super-resolution imaging methodssuch as STORM. After a round of imaging, the probes from the first poolare inactivated or removed from the sample via the methods describedabove. This process is then repeated for each successive pool of probesuntil some or all of the pools have been applied to the sample andimaged such that all positions in the codewords have been read out. Inthe simplest incarnation, there will be one hybridization and imagingstep for each position in the codeword, e.g. 3 rounds of hybridizationand imaging for a codeword with 3 positions (FIG. 1C) or 8 rounds ofhybridization and imaging for a codeword with 8 positions.

The final step of this approach is identical to that described above.

Example 3

In this example, 14 genes (PGK1, H3F3B, PKM, ENOl, GPI, EEF2, GNAS,HSPA8, GAPDH, CALM1, RHOA, PPIA, UBA52, and VCP) were encoded using asubset of the (8,4) SECDED code (FIGS. 2A-2E). To determine the accuracyof these measurements, the measured abundances of these 14 mRNAs werecompared to the abundances measured from bulk RNA-seq of A549 cells(published ENCODE data). Remarkably, it was found that there wasexcellent agreement between these two measurements, as the transcriptscount measured using the sequential hybridization approach correlatedwith gene expression measured using RNA-seq with a Pearson correlationcoefficient r of 0.75 (FIG. 2F). Gene expression from 3 other cells werealso measured, and it was found that the gene expression of these 14genes was highly correlated among the cells with a r of 0.96 (FIG. 2G).

Codebook Design. Each mRNA in the target set was assigned a binarycodeword using a Single Error Correction Double Error Detection (SECDED)code. SECDED is an extended Hamming codebook with an additional paritybit. Briefly, Matlab's Communications System toolbox was used togenerate SECDED codes of either 8 or 16 letters or positions. In bothcases, only those codewords containing four is were used. These wordswere assigned at random to mRNAs in the target set. [0 1 0 1 1 1 0 0] isan example of the 8-letter codewords used (i.e., these codewords eachcontained four 1s and four 0s.) [0 1 0 1 1 1 0 0 0 0 0 0 0 0 0] is anexample of the 16-letter codewords used (i.e., each codeword containedfour is and twelve 0s). Not every codeword was necessarily assigned toan mRNA.

Computational Assembly of ssDNA Primary Probe Sequence. The number ofprimary nucleic acid probes used for hybridization with mRNA targetsranged from 200 to 2000 unique oligonucleotides, depending on theexperiment. For example, to label 14 mRNAs with 28 oligos targeting eachgene, 392 unique sequences were used. Large number of oligos with uniquesequences were purchased in a pool from LC Sciences or CustomArray.However, array synthesized oligos were in minute quantity that wasinsufficient for in situ hybridization. The protocol for theiramplification are described below.

Each primary probe contained three components: flanking primer sequencesto allow enzymatic amplification of probes, targeting sequence for insitu hybridization to mRNAs, and secondary tag sequence containing oneor more read sequences for sequential readout of codewords.

The following is an example of a primary probe:

(SEQ ID NO: 1) GTTGGCGACGAAAGCACTGCGATTGGAACCGTCCCAAGCGTTGCGCTTAATGGATCATCAATTTTGTCTCACTACGACGGTCAATCGCGCTGCATACTTGCGTCGGTCGGACAAACGAGGThe components are arranged in the following order: forward primer (notunderlined), secondary read sequence 1 (underlined), mRNA targetingsequence (not underlined), secondary read sequence 2 (underlined), andreverse primer (not underlined). The secondary read sequences are thereverse complement of the corresponding secondary probes. Since onlycodewords that contained four ‘1’s were used, the primary probes foreach mRNA needed to contain 4 different secondary read sequences in thisexample. However, in order to reduce the overall length of the primaryprobes, the pool of targeting sequences for each mRNA target was splitat random into two pools. Two secondary read sequences are attached eachprobe in one of the two pools and the other two secondary read sequencesare attached the probes in the other pool. The design criteria for eachcomponent are described below.

Primer Design. Specific index primers were generated by a collection of240,000 published sequences of orthogonal 25-bp long sequences. Thesesequences were trimmed to 20 bp, selected for a narrow 70 to 80° C.melting temperature, the absence of consecutive repeats of 3 or morebase, and the presence of a GC clamp, i.e. one of the two 3′ terminalbases must be G or C. To further improve specificity, these sequenceswere then screened against the human genome using BLAST+ (Camacho et al2009), and primers with 14 or more contiguous bases of homology wereeliminated. In a subsequent screening via BLAST+, primers that shared 11or more contiguous bases or more than 5 bases at the 3′ end of any otherprimer or the T7 promoter were also removed.

Secondary Probes Design. 30-bp long secondary probe sequences werecreated by concatenating fragments of the orthogonal primer setdescribed above. These secondaries were then screened for orthogonalitywith other secondaries (no more than 11 basepairs of homology) and forpotential off-target binding sites in the human genome (no more than 14basepairs of homology). Secondary sequences used in this example areprovided in Table 1.

TABLE 1 Sequence Bit Secondary sequences number B1CGCAACGCTTGGGACGGTTCCAATCGGATC SEQ ID NO: 2 B2CGAATGCTCTGGCCTCGAACGAACGATAGC SEQ ID NO: 3 B3ACAAATCCGACCAGATCGGACGATCATGGG SEQ ID NO: 4 B4CAAGTATGCAGCGCGATTGACCGTCTCGTT SEQ ID NO: 5 B5TGCGTCGTCTGGCTAGCACGGCACGCAAAT SEQ ID NO: 6 B6AAGTCGTACGCCGATGCGCAGCAATTCACT SEQ ID NO: 7 B7CGAAACATCGGCCACGGTCCCGTTGAACTT SEQ ID NO: 8 B8ACGAATCCACCGTCCAGCGCGTCAAACAGA SEQ ID NO: 9 B9CGCGAAATCCCCGTAACGAGCGTCCCTTGC SEQ ID NO: 10 B10GCATGAGTTGCCTGGCGTTGCGACGACTAA SEQ ID NO: 11 B11CCGTCGTCTCCGGTCCACCGTTGCGCTTAC SEQ ID NO: 12 B12GGCCAATGGCCCAGGTCCGTCACGCAATTT SEQ ID NO: 13 B13TTGATCGAATCGGAGCGTAGCGGAATCTGC SEQ ID NO: 14 B14CGCGCGGATCCGCTTGTCGGGAACGGATAC SEQ ID NO: 15 B15GCCTCGATTACGACGGATGTAATTCGGCCG SEQ ID NO: 16 B16GCCCGTATTCCCGCTTGCGAGTAGGGCAAT SEQ ID NO: 17

mRNA Targeting Sequence Design. To determine the relative abundance ofall the isoforms of all genes expressed in these cell lines,transcriptome profiling data from the ENCODE project for total RNA fromA549 and IMR90 cells was processed using the publicly available softwarecufflinks, along with human genome annotations from gencode v18. Genemodels corresponding to the most highly expressed isoform were used tobuild a sequence library in FASTA format recording the dominant isoformof every gene. Genes of interest were selected from this library. Thesegenes were partitioned into 1 kb segments, then the softwareOligoArray2.1 was used to generate primary probe sequences for the humantranscriptome with the following constraints: 30-bp or 40-bp length,depending on the experiment; probe-target melting temperatures greaterthan 70° C. (variable parameter); no cross hybridization targets withmelting temperatures greater than 72° C. (variable parameter); nopredicted internal secondary structures with melting temperaturesgreater than 76° C. (variable parameter); and no single-nucleotidecontiguous repeats of 6 or more bases. After OligoArray probe selection,all potential probes that mapped to a different gene were rejected whileall potential probes with multiple alignments to the same gene wereretained. A BLAST database was assembled from the FASTA library of allexpressed genes to screen for probes' uniqueness. For each gene, 14 to28 targeting sequences produced during the OligoArray processing wereselected.

Probe Synthesis—index PCR. The template for specific probe sets wereselected from the complex oligopool via limited-cycle PCR. Briefly, 0.5to 1 ng of the complex oligopool was combined with 0.5 micromolar ofeach primer. The forward primer matched the priming sequence for thedesired subset while the reverse primer was a 5′ concatenation of thissequence with a T7 promoter. To avoid the generation of G-quadruplets,which can be difficult to synthesize, the terminal Gs required in the T7promoter were generated from Gs located at the 5′ of the priming regionwhere appropriate. All primers were synthesized by IDT. A 50 microliterreaction volume was amplified either using the KAPA real-time libraryamplification kit (KAPA Biosystems; KK2701) or via a homemade qPCR mixwhich included 0.8× EvaGreen (Biotum; 31000-T) and the hot-start Phusionpolymerase (New England Biolabs; M0535S). Amplification was followed inreal time using Agilent's MX300P or Biorad's CFX Connect. Individualsamples were removed immediately before the plateau in amplification tominimize distortion of template abundance due to over-amplification.Individual templates were purified with columns according to themanufacturer's instructions (Zymo DNA Clean and Concentrator; D4003) andeluted in RNase-free deionized water.

Amplification via in-vitro transcription. The template was thenamplified via in vitro transcription. Briefly, 0.5 to 1 microgram oftemplate DNA was amplified into 100-200 microgram of RNA in a single20-30 microliter reaction with a high yield RNA polymerase (New EnglandBiolabs; E2040S). Reactions were supplemented with 1× RNase inhibitor(Promega RNasin; N2611). Amplification was typically run for 4 to 16hours at 37° C. to maximize the yield. The RNA was not purified afterthe reaction and was either stored at −80° C. or immediately convertedinto DNA as described below.

Reverse Transcription. 1 to 2 nmol of fluorescently-labeled ssDNA probewas created from the above in vitro transcription reactions using thereverse transcriptase Maxima H- (Thermo Scientific; EP0751). This enzymewas used because of its higher processivity and temperature resistance,which allowed the conversion of large quantities of RNA into DNA withinsmall volumes at temperatures that disfavor secondary structureformation. The unpurified RNA created above was supplemented with 1.6 mMof each dNTP, 1-2 nmol of fluorescently labeled forward primer, 300units of Maxima H-, 60 units of RNasin, and a final 1× concentration ofthe Maxima RT buffer. The final 75 microliter volume was incubated at50° C. for 60 minutes.

Strand Selection and Purification. The template RNA in the reactionabove was then removed from the DNA via alkaline hydrolysis. 75microliters of 0.25 M EDTA and 0.5 N NaOH were added to each reversetranscription reaction, and the sample was incubated at 95° C. for 10minutes. The reaction was immediately neutralized by purifying the ssDNAprobe with a modified version of the Zymo Oligo Clean and Concentratorprotocol. Specifically, the 5-microgram capacity column was replacedwith a 25-microgram or 100-microgram capacity DNA column as appropriate.The remainder of the protocol was run according to the manufacturer'sinstructions. Probe was eluted in 100 microliter RNase-free deionizedwater and evaporated in a vacuum concentrator. The final pellet wasresuspended in 10 microliters of RNase-free water and stored at −20° C.Denaturing poly-acrylimide gel electrophoresis and absorptionspectroscopy revealed that this protocol typically produced 90-100%incorporation of the fluorescent primer into full length probe and75-90% recovery of the total fluorescent probe. Thus, without exceedinga 150-microliter reaction volume, this protocol can be used to create ˜2nmol of fluorescent probe.

Cell culture and fixation. A549 and IMR90 cells (American Type CultureCollection) were cultured with Dulbecco's Modified Eagle Medium andEagle's Minimum Essential Medium respectively. Cells were incubated at37° C. with 5% CO₂ for 36-48 hours. Cells were fixed in 3%paraformaldehyde (Electron Microscopy Sciences) in PBS for 15 minutes,washed with PBS, and permeabilized in 70% ethanol overnight at 4° C.

Fluorescence In Situ Hybridization (FISH)—primary (encoding) probes.Cells were hydrated in wash buffers (2×SSC, 50% formamide) for 10minutes, labelled with primary oligos (0.5 nM per sequence) inhybridization buffers (2×SSC, 50% formamide, 1 mg/mL yeast tRNA, and 10%dextran sulfate) overnight at 37° C., washed with wash buffers at 47° C.for 10 minutes twice, and washed with 2×SSC twice. Fluorescent fiducialbeads (Molecular Probes, F-8809) were added at a 1:10,000 dilution in2×SSC before imaging.

Secondary probes. Secondary (readout) probes (10 nM) were hybridized insecondary hybridization buffers (2×SSC, 20% formamide, and 10% dextransulfate) to their primary targets for 30 minutes at 37° C. Cellsremained on the microscope stage during the hybridization. An objectiveheater was used to maintain the temperature at 37° C. Cells were washedwith secondary wash buffers (2×SSC, 20% formamide).

Fluidics and STORM Imaging. Multiple rounds of sequential labeling,washing, imaging, and bleaching were performed on an automated platformconsisted of a fluidics setup and a STORM (stochastic opticalreconstruction microscopy) microscope. The fluidics setup included aflow chamber (Bioptech FCS2), a peristaltic pump (Rainin Dynamax RP-1),and three computer-controlled 8-way valves (Hamilton MVP and HamiltonHVXM 8-5). This system allowed the automated integration of STORM moviecollection and secondary hybridization.

The imaging buffer included, 50 mM Tris (pH 8) 10% (w/v) glucose, 1% PME(2-mercaptoethanol) or 25 mM MEA, with or without 2 mM1,5-cyclooctadiene, and an oxygen scavenging system (0.5 mg/ml glucoseoxidase (Sigma-Aldrich) and 40 microgram/ml catalase (Sigma-Aldrich)). Alayer of mineral oil was used to seal the imaging buffer, preventing itsacidification over the course of multiple hybridizations.

The STORM setup included an Olympus IX-71 inverted microscope configuredfor oblique incidence excitation. The samples were continuouslyilluminated with a 642-nm diode-pumped solid-state laser (VFL-P500-642;MPB communications). A 405-nm solid-state laser (Cube 405-100C;Coherent) was used for activation of dyes. Fluorescence was collectedusing an Olympus (UPlanSApo 100×, 1.4 NA) objective lens and passedthrough a custom dichroic, as well as a quad-view beam splitter. Allmovies were recorded using an EMCCD camera (Andor iAxon 897), imaging at60 Hz. The 512×256 field of view of the camera was split into separate256×256 pixel movies prior to saving. The left half of this field ofview contained the STORM data and the right half contained images of thefluorescent feducial beads. These latter movies were downsampled to 1 Hzprior to saving. During data acquisition, a home-built focus lock wasused to maintain a constant focal plane. STORM movies included 20,000 to30,000 frames in STORM buffer while the bleach movies included 10,000frames in wash buffer.

Image Analysis—analysis of single-molecule localizations. Movies ofsingle molecule localizations and fluorescent feducial beads wereprocessed separately using a previously published single-emitterlocalization software.

Image Registration. The starting position of the beads from each roundof hybridization were used to align movies from each round. The 2Dautocorrelation between bead images of consecutive hybridizationsfollowed by nearest-neighbor matching was used to match beads betweenimages. The pair of beads with the most similar displacement vector wereused to compute a rigid translation-rotation warp to align the beads.This alignment method is robust to samples in which multiple feducialsare displaced or come detached and reattach during imaging.

Drift Correction. Drift during image acquisition was corrected using thetrajectory of the feducial beads (recorded at 1 Hz). Bead positions werelinked in each frame. The trajectory of the two beads that moved in themost correlated fashion was taken as the drift trajectory.

mRNA Cluster Calling. Localizations were first screened to be above athreshold number of photons (generally 2000) and required to be within32 nm of 5 other localizations (parameters may be tuned). The remainingmolecule localizations were binned in a 2D histogram of 10×10 nm bins(bin size is a variable parameter). All connected bins were taken to bepart of a cluster (diagonal contacts are classified as connected).Clusters were required to have more than 80 total localizations acrossall hybridizations (variable parameter) to be called an mRNA cluster.The weighted centroids of these clusters from the 2D histogram wererecorded as the mRNA positions.

A given cluster is recorded to be represented in an individualhybridization round if more than 9 localizations (variable parameter)are found within a 48 nm radius (variable parameter) of the centroid forthat mRNA in each hybridization round.

Cluster Decoding. For each mRNA cluster, a codeword is readout,including “0”s for all the hybridization rounds in which less than thethreshold number of localizations are found near the centroid and “1” sfor the rounds where above threshold number of localizations arecounted. The SECDED codebook decoded these as either perfect matches totarget mRNA codewords, correctable errors which can be unambiguouslymapped back to target mRNA, or uncorrectable errors, which differed bytwo or more letters from the words in the codebook.

FIG. 2A shows a STORM image of a cell. FIG. 2B shows a zoom in of theboxed region in FIG. 2A. Each dot indicates a localization.Localizations from different rounds of imaging are shown differently.FIG. 2C shows a representative cluster of localizations from the boxedregion in FIG. 2B. The cluster shows localization signals from 4different hybridizations. This cluster is a putative mRNA encoded withcodeword [0 1 0 1 1 1 0 0]. FIG. 2D shows a reconstructed cell image of14 genes after decoding and error correction. Each gene is showndifferently. FIG. 2E shows measured gene expression for the 14 genesfrom the cell. FIG. 2F shows a comparison of transcript count withensemble RNA-sequencing data. FIG. 2G shows correlation of transcriptexpression level between two cells detected using the describedapproach.

Example 4

The following examples are generally directed to multiplexedsingle-molecule imaging with error-robust encoding allowing forsimultaneous measurements of thousands of RNA species in single cells.In general, knowledge of the expression profile and spatial landscape ofRNAs in individual cells is essential for understanding the richrepertoire of cellular behaviors. The following examples reports varioustechniques directed to single-molecule imaging approaches that allow thecopy numbers and spatial localizations of thousands of RNA species to bedetermined in single cells. Some of these techniques are calledMultiplexed Error-Robust Fluorescence in Situ Hybridization or“MERFISH.”

Using error-robust encoding schemes to combat single-molecule labelingand detection errors, these examples demonstrated the imaging ofhundreds to thousands of unique RNA species in hundreds of individualcells. Correlation analysis of the ˜10⁴ to ˜10⁶ pairs of genes allowedconstraints on gene regulatory networks, prediction of novel functionsfor many unannotated genes, and identification of distinct spatialdistribution patterns of RNAs that correlate with properties of theencoded proteins.

System-wide analyses of the abundance and spatial organization of RNAsin single cells promise to transform understanding in many areas of celland developmental biology, such as the mechanisms of gene regulation,the heterogeneous behavior of cells, and the development and maintenanceof cell fate. Single-molecule fluorescence in situ hybridization(smFISH) has emerged as a powerful tool for studying the copy number andspatial organization of RNAs in single cells either in isolation or intheir native tissue context. Taking advantage of its ability to map thespatial distributions of specific RNAs with high resolution, smFISH hasrevealed the importance of subcellular RNA localization in diverseprocesses such as cell migration, development, and polarization. Inparallel, the ability of smFISH to precisely measure the copy numbers ofspecific RNAs without amplification bias has allowed quantitativemeasurement of the natural fluctuations in gene expression, which has inturn elucidated the regulatory mechanisms that shape such fluctuationsand their role in a variety of biological processes.

However, application of the smFISH approach to many systems-levelquestions remains limited by the number of RNA species that can besimultaneously measured in single cells. State-of-the-art efforts usingcombinatorial labeling by either color-based barcodes or sequentialhybridization have enabled simultaneous measurements of 10-30 differentRNA species in individual cells, yet many interesting biologicalquestions would benefit from the measurement of hundreds to thousands ofRNAs within a single cell, which are not achievable using suchtechniques. For example, analysis of how the expression profile of sucha large number of RNAs vary from cell to cell and how these variationscorrelate among different genes could be used to systematically identifyco-regulated genes and map regulatory networks; knowledge of thesubcellular organizations of numerous RNAs and their correlations couldhelp elucidate molecular mechanisms underlying the establishment andmaintenance of many local cellular structures; and RNA profiling ofindividual cells in native tissues could allow in situ identification ofcell type.

The following examples generally discuss certain techniques calledMERFISH, which are highly multiplexed smFISH imaging methods thatsubstantially increase the number of RNA species that can besimultaneously imaged in single cells by using combinatorial labelingand sequential imaging with error-robust encoding schemes. Theseexamples demonstrate this multiplexed imaging approach by simultaneouslymeasuring 140 RNA species using an encoding scheme that can both detectand correct errors and 1001 RNA species using an encoding scheme thatcan detect but not correct errors. It should be understood that thesenumbers are by way of exemplification only, not limitation. Correlationanalyses of the copy number variations and spatial distributions ofthese genes allowed us to identify groups of genes that are co-regulatedand groups of genes that share similar spatial distribution patternsinside the cell.

Combinatorial labeling with error-robust encoding schemes. Combinatoriallabeling that identifies each RNA species by multiple (N) distinctsignals offers a route to rapidly increase the number of RNA speciesthat can be probed simultaneously in individual cells (FIG. 5A).However, this approach to scaling up the throughput of smFISH to thesystems scale faces a significant challenge because not only does thenumber of addressable RNA species increases exponentially with N, butthe detection error rates also grow exponentially with N (FIGS. 5B-5D).Imagine a conceptually simple scheme to implement combinatoriallabeling, where each RNA species is encoded with a N-bit binary word andthe sample is probed with N corresponding rounds of hybridization, eachround targeting only the subset of RNAs that should read ‘1’ in thecorresponding bit (FIG. 11 ). N rounds of hybridization would allow2^(N)−1 RNA species to be probed. With just 16 hybridizations, over64,000 RNA species, which should cover the entire human transcriptomeincluding both messenger RNAs (mRNAs) and non-coding RNAs, could beidentified (FIG. 5B; upper symbols). However, as N increases, thefraction of RNAs properly detected (the calling rate) would rapidlydecrease and, more troublingly, the fraction of RNAs that are identifiedas incorrect species (the misidentification rate) would rapidly increase(FIG. 5C, lower symbols; FIG. 5D, upper symbols). With realistic errorrates per hybridization (measured below), the majority of RNA moleculeswould be misidentified after 16 rounds of hybridizations!

To address this challenge, error-robust encoding schemes were designed,in which only a subset of the 2^(N)−1 words separated by a certainHamming distance were used to encode RNAs. In a codebook where theminimum Hamming distance is 4 (HD4 code), at least four bits must beread incorrectly to change one code word into another (FIG. 12A). As aresult, every single-bit error produces a word that is uniquely close toa single code word, allowing such errors to be detected and corrected(FIG. 12B). Double-bit errors produce words with an equal Hammingdistance of 2 from multiple code words and, thus, can be detected butnot corrected (FIG. 12C). Such a code should substantially increase thecalling rate and reduce the misidentification rate (FIGS. 5C and 5D,middle symbols). To further account for the fact that it is more likelyto miss a hybridization event (an 1-->0 error) than to misidentify abackground spot as an RNA (an 0-->1 error) in smFISH measurements, amodified HD4 (MHD4) code was designed, in which the number of ‘1’ bitswere kept both constant and relatively low, only four per word, toreduce error and avoid biased detection. This MHD4 code should furtherincrease the calling rate and reduce the misidentification rate (FIG.5C, upper symbols; FIG. 5D, lower symbols).

In addition to the error considerations, several practical challengeshave also made it difficult to probe a large number of RNA species, suchas the high cost of the massive number of fluorescently labeled FISHprobes needed and the long time required to complete many rounds ofhybridization. To overcome these challenges, in this example, a two-steplabeling scheme was designed to encode and readout cellular RNAs (FIG.5E). First, cellular RNAs were labeled with a set of encoding probes(also called primary probes), each probe comprising a RNA targetingsequence and two flanking readout sequences. Four of the N uniquereadout sequences were assigned to each RNA species based on the MHD4code word of the RNA. Second, these N readout sequences were identifiedwith complementary FISH probes, the readout probes (also calledsecondary probes) via N rounds of hybridization and imaging, each roundusing a unique readout probe. To increase the signal to backgroundratio, every cellular RNA was labeled with ˜192 encoding probes. Becauseeach encoding probe contained two of the four readout sequencesassociated with that RNA (FIG. 5E), a maximum of ˜96 readout probes canbind to each cellular RNA per hybridization round. To generate themassive number of encoding probes required, they were amplified fromarray-derived oligonucleotide pools containing tens of thousands ofcustom sequences using an enzymatic amplification process comprising invitro transcription followed by reverse transcription (FIG. 13 , seebelow regarding probe synthesis). This two-step labeling approachsignificantly diminished the total hybridization time for an experiment:it was found that efficient hybridization to the readout sequences tookonly 15 minutes whereas efficient direct hybridization to cellular RNArequired more than 10 hours.

FIGS. 5A-5E describe MERFISH, a highly multiplexed smFISH approach usingcombinatorial labeling and error-robust encoding. FIG. 5A shows aschematic depiction of the identification of multiple RNA species in Nrounds of imaging. Each RNA species is encoded with a N-bit binary wordand during each round of imaging, only the subset of RNAs that shouldread ‘1’ in the corresponding bit emit signal. FIGS. 5B-5D show thenumber of addressable RNA species (FIG. 5B), the rate at which theseRNAs are properly identified (calling rate) (FIG. 5C), and the rate atwhich RNAs are incorrectly identified as a different RNA species(misidentification rate) (FIG. 5D) plotted as a function of the numberof bits (N) in the binary words encoding RNA. In FIGS. 5B and 5D, theupper dots are a simple binary code that includes all 2^(N)−1 possiblebinary words; the middle dots are the HD4 code where the Hammingdistance separating words is 4; and the lower dots are the modified HD4(MHD4) code where the number of ‘1’ bits are kept at four. These arereversed in FIG. 5C.

The calling and misidentification rates are calculated with per biterror rates of 10% for the 1-->0 error and 4% for the 0-->1 error. FIG.5E is as schematic diagram of the implementation of a MHD4 code for RNAidentification. Each RNA species is first labeled with ˜192 encodingprobes that convert the RNA into a unique combination of readoutsequences (Encoding hyb). These encoding probes each contain a centralRNA targeting region flanked by two readout sequences, drawn from a poolof N different sequences, each associated with a specific hybridizationround. Encoding probes for a specific RNA species contain a uniquecombination of four of the N readout sequences, which correspond to thefour hybridization rounds where this RNA should read ‘1’. N subsequentrounds of hybridization with the fluorescent readout probes were used toprobe the readout sequences (hyb 1, hyb 2, . . . , hyb N). The boundprobes were inactivated by photobleaching between successive rounds ofhybridization. For clarity, only one possible pairing of the readoutsequences is depicted here for the encoding probes; however, allpossible pairs of the four readout sequences are used at the samefrequency and distributed randomly along each cellular RNA in the actualexperiments.

FIG. 11 shows a schematic description of a combinatorial labelingapproach based on a simple binary code. In a conceptually simplelabeling approach, 2^(N)−1 different RNA species can be uniquely encodedwith all N-bit binary words (excluding the word with all ‘0’s). In eachhybridization round, FISH probes that are targeted to all RNA speciesthat have a ‘1’ in the corresponding bit are included. To increase theability to discriminate RNA spots from background, each RNA is addressedwith multiple FISH probes per hybridization round. Signal from the boundprobes is extinguished before the next round of hybridization. Thisprocess continues for all N hybridization rounds (hyb 1, hyb 2, . . . ),and all 2^(N)−1 RNA species can be identified by the unique on-offpattern of fluorescence signals in each hybridization round.

FIGS. 12A-12C show schematic descriptions of Hamming distance and itsuse in the identification and correction of errors. FIG. 12A is aschematic representation of a Hamming distance of 4. FIGS. 12B and 12Care schematics showing the ability of an encoding scheme with Hammingdistance 4 to correct single-bit errors (FIG. 12B) or detect but notcorrect double-bit errors (FIG. 12C). Arrows highlight bits at which theindicated words differ. Two code words are separated by a Hammingdistance of 4 if one of the words has to flip four bits from ‘1’ to ‘0’or ‘0’ to ‘1’ to convert into the other word. Single-bit errorcorrection is possible because if a measured word differs from alegitimate code word by only one bit, it is most likely an error thatarises from misreading this code word, since the code words of all theother RNA species will differ from the measured word by at least threebits. In this case, the measured word can be corrected to a code wordthat differs by only one bit. If a measured word differs from alegitimate code word by two bits, this measured word can still beidentified as an error, but correction is no longer possible since morethan one legitimate code word differs from this measured word by twobits.

FIG. 13 shows production of the library of encoding probes. Anarray-synthesized complex oligopool, containing ˜100 k sequences, isused as a template for the enzymatic amplification of the encodingprobes for different experiments. Each template sequence in theoligopool contains a central target region that can bind to a cellularRNA, two flanking readout sequences, and two flanking index primers. Inthe first step, the required template molecules for a specificexperiment are selected and amplified with an indexed PCR reaction. Toallow amplification via in vitro transcription, a T7 promotor is addedto the PCR products during this step. In the second step, RNA isamplified from these template molecules via in vitro transcription. Inthe third step, this RNA is reverse transcribed back into DNA. In thefinal step, the template RNA is removed via alkaline hydrolysis, leavingonly the desired ssDNA probes. This protocol produces ˜2 nmol of complexpools of encoding probes containing ˜20,000 different sequences for the140-gene experiments or ˜100,000 different sequences for the 1001-geneexperiments.

Example 5

This example illustrates the measurement of 140 genes with MERFISH usinga 16-bit MHD4 Code. To test the feasibility of this error-robust,multiplexed imaging approach, this example uses a 140-gene measurementon human fibroblast cells (IMR90) using a 16-bit MHD4 code to encode 130RNA species while leaving 10 code words as misidentification controls(FIGS. 20A-20H). After each round of hybridization with the fluorescentreadout probes, cells were imaged by conventional wide-field imagingwith an oblique-incidence illumination geometry. Fluorescent spotscorresponding to individual RNAs were clearly detected and were thenefficiently extinguished via a brief photobleaching step (FIG. 6A). Thesample was stable throughout the 16 rounds of iterative labeling andimaging. The change in the number of fluorescent spots from round toround matched the predicted change based on the relative abundances ofRNA species targeted in each round derived from bulk sequencing, and asystematic decreasing trend with increasing number of hybridizationrounds was not observed (FIG. 14A). The average brightness of the spotsvaried from round to round with a standard deviation of 40%, likely dueto different binding efficiencies of the readout probes to the differentreadout sequences on the encoding probes (FIG. 14B). Only a small,systematic decreasing trend in the spot brightness with increasinghybridization rounds was observed, which was on average 4% per round(FIG. 14B).

Next, binary words were constructed from the observed fluorescent spotsbased on their on-off patterns across the 16 hybridization rounds (FIG.6B-6D). If the word exactly matched one of the 140 MHD4 code words(exact matches) or differed by only one bit (error-correctable matches),it was assigned to the corresponding RNA species (FIG. 6D). Within thesingle cell depicted in FIGS. 6A and 6B, more than 1500 RNA moleculescorresponding to 87% of the 130 encoded RNA species were detected aftererror correction (FIG. 6E). Similar observations were made in ˜400 cellsfrom 7 independent experiments. On average, ˜4 times as many RNAmolecules and ˜2 times as many RNA species were detected per cell aftererror correction as compared with the values obtained before errorcorrection (FIGS. 15A-15B).

Two types of errors can occur in the copy number measurement of each RNAspecies: 1) Some molecules of this RNA species are not detected, leadingto a drop in calling rate, and 2) some molecules from other RNA speciesare misidentified as this RNA species. To assess the extent ofmisidentification, the 10 misidentification control words were utilized,i.e., code words that were not associated with any cellular RNA.Although matches to these control words were observed, they occurred farless frequently than the real RNA-encoding words: 95% of the 130RNA-encoding words were counted more frequently than the median countfor these control words. Moreover, it was typically found that the ratioof the number of exact matches to the number of matches with one-biterrors for a real RNA-encoding word was substantially higher than thesame ratios observed for the misidentification controls, as expected(FIGS. 16A and 16B). Using this ratio as a measure of the confidence inRNA identification, it was found that 91% of the 130 RNA species had aconfidence ratio greater than the maximum confidence ratio observed forthe misidentification controls (FIG. 6F), demonstrating a high accuracyof RNA identification. Subsequent analyses were conducted only on these91% of genes.

To estimate the calling rate, the error-correction ability of the MHD4code was utilized to determine the 1-->0 error rates (10% on average)and 0-->1 error rates (4% on average) for each hybridization round(FIGS. 16C and 16D). Using these error rates, an ˜80% calling rate forindividual RNA species after error correction was estimated, i.e. ˜80%of the fluorescent spots corresponding to a RNA species were decodedcorrectly (FIG. 16E). It is noted that although the remaining 20% ofspots contributed to a loss in detection efficiency, most of them didnot cause species misidentification because they were decoded asdouble-bit error words and discarded.

To test for potential technical bias in these measurements, the same 130RNAs species were probed with a different MHD4 codebook by shuffling thecode words among different RNA species (FIGS. 20A-20H) and changing theencoding probe sequences. Measurements with this alternative code gavesimilar misidentification and calling rates (FIGS. 17A-17D). The copynumbers of individual RNA species per cell measured with these twocodebooks showed excellent agreement with a Pearson correlationcoefficient of 0.94 (FIG. 6G), indicating that the choice of encodingscheme did not bias the measured counts.

In order to validate the copy numbers derived from the MERFISHexperiments, conventional smFISH measurements were performed on 15 ofthe 130 genes, selected from the full measured abundance range of threeorders of magnitude. For each of these genes, both the average copynumber and the copy number distribution across many cells agreedquantitatively between the MERFISH and conventional smFISH measurements(FIGS. 18A and 18B). The ratio of the copy numbers determined by thesetwo approaches was 0.82+/−0.06 (mean+/−SEM across the 15 measured RNAspecies, FIG. 18B), which agreed with the estimated 80% calling rate forthe multiplexed imaging approach. The quantitative match between thisratio and the estimated calling rate over the full measured abundancerange additionally supports the assessment that the misidentificationerror was low. Given that the agreement between the MERFISH andconventional smFISH results extended to the genes at the lowest measuredabundance (<1 copy per cell, FIG. 18B), it was estimated that themeasurement sensitivity was at least 1 copy per cell.

As a final validation, the abundance of each RNA species averaged overhundreds of cells was compared to those obtained from a bulk RNAsequencing measurement that were performed on the same cell line. Theimaging results correlated remarkably well with bulk sequencing resultswith a Pearson correlation coefficient of 0.89 (FIG. 6H).

FIGS. 6A-6H show simultaneous measurement of 140 RNA species in singlecells using MERFISH with a 16-bit MHD4 code. FIG. 6A shows images of RNAmolecules in an IMR90 cell after each hybridization round (hyb 1-hyb16). The image after photobleaching (bleach 1) demonstrated efficientremoval of fluorescent signals between hybridizations. FIG. 6B shows thelocalizations of all detected single molecules in this cell coloredbased on their measured binary words. Inset: the composite fluorescentimage of the 16 hybridization rounds for the boxed sub-region withnumbered circles indicating potential RNA molecules. A circle indicatesan unidentifiable molecule, the binary word of which does not match anyof the 16-bit MHD4 code words even after error correction. FIG. 6C showsfluorescent images from each round of hybridization for the boxedsub-region in FIG. 6B with circles indicating potential RNA molecules.FIG. 6D shows corresponding words for the spots identified in FIG. 6C.Crosses represent the corrected bits. FIG. 6E shows the RNA copy numberfor each gene observed without (lower) or with (higher) error correctionin this cell. FIG. 6F shows the confidence ratio measured for the 130RNA species (left) and the 10 misidentification control words (right)normalized to the maximum value observed from the misidentificationcontrols (dashed line). FIG. 6G is a scatter plot of the average copynumber of each RNA species per cell measured with two shuffled codebooksof the MHD4 code. The Pearson correlation coefficient is 0.94 with ap-value of 1×10⁻³. The dashed line corresponds to the y=x line. FIG. 6His a scatter plot of the average copy number of each RNA species percell versus the abundance determined by bulk sequencing in fragments perkilobase per million reads (FPKM). The Pearson correlation coefficientbetween the logarithmic abundances of the two measurements was 0.89 witha p-value of 3×10⁻³⁹.

FIGS. 14A-14B show the number and average brightness of the fluorescentspots detected in the 16 rounds of hybridization before and afterphotobleaching. FIG. 14A shows the number of fluorescent spots observedper cell before (higher) and after (lower) photobleaching as a functionof hybridization round averaged across all measurements with the first16-bit MHD4 code. Photobleaching reduces the number of fluorescent spotsby two or more orders of magnitude. Hybridization rounds without lowerbars represent rounds in which no molecules were observed afterbleaching. Also depicted is the expected change in the number offluorescent spots from round to round (circles) predicted based on therelative abundances of the RNA species targeted in each hybridizationround derived from bulk RNA sequencing. The average discrepancy betweenthe observed and predicted number of spots for each hybridization isonly 15% of the mean number of spots. This discrepancy does notsystematically increase with the number of hybridization rounds. FIG.14B shows the average brightness of the identified fluorescent spots ineach hybridization round averaged across all measurements with the first16-bit MHD4 code both before (upper) and after (lower) photobleaching.Brightness varies by 40% (standard deviation) across differenthybridization rounds. The variation pattern is reproducible betweenexperiments with the same code, likely due to differences in the bindingefficiency of the readout probes to the different readout sequences.There is a small systematic trend of decrease in the brightness withincreasing hybridization rounds, which is on average 4% per round.Photobleaching extinguishes fluorescence to a level similar to that ofthe autofluorescence of the cell.

FIGS. 15A-15B show error correction substantially increases the numbersof RNA molecules and RNA species detected in individual cells. FIG. 15Ashows a histogram of the ratio of the total number of molecules detectedper cell with error correction to the number measured without errorcorrection. FIG. 15B is a histogram of the total number of RNA speciesdetected in each cell with error correction to that without errorcorrection. Both ratios are determined for ˜200 cells and the histogramsare constructed from these ratios.

FIGS. 16A-16E show characterization of the misidentification and callingrates of RNA species for the 140-gene experiments using a specific16-bit MHD4 code. FIG. 16A shows the number of measured words exactlymatching the code word corresponding to FLNA, represented by the bar inthe center of the circle, and the number of measured words with one-biterror compared to the code word of FLNA, represented by the 16 bars onthe circle. FIG. 16B is the same as FIG. 16A, but for a code word thatwas not assigned to any RNA, i.e., a misidentification control word. Thesolid lines connect the exact match to one-bit error words that aregenerated by 1-->0 errors. Based on the observation that the ratio ofthe number of exact matches to the number of error-correctable matchesfor a real RNA-encoding word was typically substantially higher than thesame ratios observed for the misidentification controls, this ratio wasdefined as a confidence ratio for RNA identification. The confidenceratio measured for all 130 RNA species (center bar) and 10misidentification control words not assigned to any RNA (outer bars)using this 16-bit MHD4 code is show in FIG. 6F. FIGS. 16C and 16D showthe average error rates for the 1-->0 error (FIG. 16C) and 0-->1 error(FIG. 16D) for each hybridization round. FIG. 16E shows the calling ratefor each RNA species estimated from the 1-->0 and 0-->1 error rates.Genes are sorted from left to right based on the measured abundance,which spans three orders of magnitude. The calling rates are largelyindependent of the abundance of the gene.

FIGS. 17A-17D show characterization of the misidentification and callingrates for a second 16-bit MHD4 code. In this second encoding scheme, the140 code words were shuffled among different RNA species and changed theencoding probe sequences. FIG. 17A shows the normalized confidence ratiomeasured for the 130 RNA species (left) and the 10 misidentificationcontrol words not assigned to any RNA (right). The normalized confidenceratio is determined the same way as in FIG. 6F. FIGS. 17B and 17C showthe average error rates determined for the 1-->0 error (FIG. 17B) and0-->1 error (FIG. 17C) for each hybridization round. FIG. 17D shows thecalling rate determined for each RNA species estimated from the 1-->0and 0-->1 error rates. Genes are sorted from left to right based on themeasured abundance.

FIGS. 18A-18C show a comparison of the MERFISH measurements withconventional smFISH results for a subset of genes. FIG. 18A shows thedistributions of RNA copy numbers in single cells for three examplegenes KIAA1199, DYNC1H1, and LMTK2 in the high, medium, and lowabundance ranges, respectively. Lighter bars: distributions constructedfrom ˜400 cells in the 140-gene measurements using the MHD4 codes.Darker bars: distributions constructed from ˜100 cells in theconventional smFISH measurements. FIG. 18B shows a comparison of theaverage RNA copy numbers per cell measured in the 140-gene experimentsusing the MHD4 codes to those determined by conventional smFISH for 15genes. The average ratio of the copy number measured using the MHD4measurements to that measured using conventional smFISH was 0.82+/−0.06(mean+/−SEM across 15 genes). The dashed line corresponds to the y=xline.

FIGS. 20A-20H show two different codebooks for the 140-gene experiments.The specific code words of the 16-bit MHD4 code assigned to each RNAspecies studied in the two shuffles of the 140-gene experiment. The“Genes” columns contain the name of the gene. The “Codewords” columnscontain the specific binary word assigned to each gene.

Example 6

This example is generally directed to high-throughput analysis ofcell-to-cell variation in gene expression. The MERFISH approach allowsparallelization of measurements of many individual RNA species andco-variation analysis between different RNA species. In this example,the parallelization aspect was first illustrated by examining thecell-to-cell variation in the expression level of each of the measuredgenes (FIG. 7A). To quantify the measured variation, the Fano factors,defined as the ratio of the variance to the mean RNA copy number, werecomputed for all measured RNA species. The Fano factors substantiallydeviated from 1, the value expected for a simple Poisson process, formany genes and exhibited an increasing trend with the mean RNA abundance(FIG. 7B). This trend of increasing Fano factors with mean RNA abundancecan be explained by changes in the transcription rate and/or promoteroff-switching rates but not by changes in the promoter on-switchingrate.

Moreover, several RNA species were identified with substantially largerFano factors than this average trend. For example, it was found thatSLC5A3, CENPF, MKI67, TNC and KIAA1199 displayed Fano factor valuessubstantially higher than those of the other genes expressed at similarabundance levels. The high variability of some of these genes can beexplained by their association with the cell cycle. For example, two ofthese particularly ‘noisy’ genes MKI67 and CENPF were both annotated ascell-cycle related genes, and based on their bimodal expression (FIG.7C), it is proposed that their transcription is strongly regulated bythe cell cycle. Other high-variability genes did not show the samebimodal expression patterns and are not known to be associated with thecell cycle.

Analysis of co-variations in the expression levels of different genescan reveal which genes are co-regulated and elucidate gene regulatorypathways. At the population level, such analysis often requires theapplication of external stimuli to drive gene expression variation;hence, correlated expression changes can be observed among genes thatshare common regulatory elements influenced by the stimuli. At thesingle-cell level, one can take advantage of the natural stochasticfluctuations in gene expression for such analysis and can thus studymultiple regulatory networks without having to stimulate each of themindividually. Such co-variation analysis can constrain regulatorynetworks, suggest new regulatory pathways, and predict function forunannotated genes based on associations with co-varying genes.

This approach was applied to the 140-gene measurements and the ˜10,000pairwise correlation coefficients describing how the expression levelsof each pair of genes co-varied from cell to cell were examined. Many ofthe highly variable genes showed tightly correlated or anti-correlatedvariations (FIG. 7C). To better understand the correlations for all genepairs, a hierarchical clustering approach was adopted to organize thesegenes based on their correlation coefficients (FIG. 7D). From thecluster tree structure, seven groups of genes with substantiallycorrelated expression patterns were identified (FIG. 7D). Within each ofthe seven groups, every gene showed significantly stronger averagecorrelation with other members of the group than with genes outside thegroup. To further validate and understand these groups, gene ontology(GO) terms enriched in each of these seven groups were identified.Notably, the enriched GO terms within each group shared similarfunctions and were largely unique to each group (FIGS. 7E and 7F),validating the notion that the observed co-variation in expressionreflects some commonalities in the regulation of these genes.

This example describes two of these groups as illustrative examples. Thepredominant GO terms associated with Group 1 were terms associated withthe extracellular matrix (ECM) (FIGS. 7D to 7F). Notable members of thisgroup included ECM components, such as FBN1, FBN2, COL5A, COL7A and TNC,and glycoproteins linking the ECM and cell membranes, such as VCAN andTHBS1. The group also included an unannotated gene, KIAA1199, which maypredict to play a role in ECM metabolism based on its association withthis cluster. Indeed, this gene has recently been identified as anenzyme involved in the regulation of hyaluronan, a major sugar componentof the ECM.

Group 6 contained many genes that encode vesicle transport proteins andproteins associated with cell motility (FIGS. 7D to 7F). The vesicletransport genes included microtubule motors and related genes DYNCIH,CKAP1, and factors associated with vesicle formation and trafficking,like DNAJC13 and RAB3B. Again, an unannotated gene, KIAA1462, was foundwithin this cluster. Based on its strong correlation with DYNCiH1 andDNAJC13, it is predicted that this gene may be involved in vesicletransport. The cell motility genes in this group included actin-bindingproteins like AFAP1, SPTAN1, SPTBN1, and MYH10, and genes involved inthe formation of adhesion complexes, like FLNA and FLNC. SeveralGTPase-associated factors involved in the regulation of cell motility,attachment and contraction also fell into this group, including DOCK7,ROCK2, IQGAP1, PRKCA, and AMOTL1. The observation that some cellmotility genes correlated with vesicle transport genes is consistentwith the role of vesicle transport in cell migration. An additionalinteresting feature of group 6 is that a subset of these genes, inparticular those related to cell motility, were anti-correlated withmembers of the ECM group discussed above (FIG. 7D). Thisanti-correlation may reflect regulatory interactions that mediateswitching of cells between adherent and migratory states.

FIGS. 7A-7F show cell-to-cell variations and pairwise correlations forthe RNA species determined from the 140 gene measurements. FIG. 7A showsa comparison of gene expression levels in two individual cells. FIG. 7Bshows Fano factors for individual genes. Error bars represent standarderror of the mean determined from 7 independent data sets. FIG. 7C showsZ-scores of the expression variations of four example pairs of genesshowing correlated (top two) or anti-correlated (bottom two) variationfor 100 randomly selected cells. The Z-score is defined as thedifference from the mean normalized by the standard deviation. FIG. 7Dis a matrix of the pairwise correlation coefficients of the cell-to-cellvariation in expression for the measured genes, shown together with thehierarchical clustering tree. The seven groups identified by a specificthreshold on the cluster tree (dashed line) are indicated by the blackboxes in the matrix and lines on the tree, with grey lines on the treeindicating ungrouped genes. Different threshold choices on the clustertree could be made to select either smaller subgroups with tightercorrelations or larger super-groups containing more weakly coupledsubgroups. Two of the seven groups are enlarged on the right. FIGS.7E-7F show enrichment of 30 selected, statistically significantlyenriched GO terms in the seven groups. Enrichment refers to the ratio ofthe fraction of genes within a group that have the specific GO term tothe fraction of all measured genes having that term. Not all of the GOterms presented here are in the top 10 list.

Example 7

This example illustrates mapping spatial distributions of RNAs. As animaging based approach, MERFISH also allowed the investigation of thespatial distributions of many RNA species simultaneously. Severalpatterns emerged from the visual inspection of individual genes, withsome RNA transcripts enriched in the perinuclear region, some enrichedin the cell periphery, and some scattered throughout the cell (FIG. 8A).To identify genes with similar spatial distributions, the correlationcoefficients for the spatial density profiles of all pairs of RNAspecies were determined, and these RNAs were organized based on thepairwise correlations again using the hierarchical clustering approach.The correlation coefficient matrix showed groups of genes withcorrelated spatial organizations, and the two most notable groups withthe strongest correlations are indicated in FIG. 8B. Group I RNAsappeared enriched in the perinuclear region whereas group II RNAsappeared enriched near the cell peripheral region (FIG. 8C).Quantitative analysis of the distances between each RNA molecule and thecell nucleus or the cell periphery indeed confirmed this visualimpression (FIG. 8D).

Group I contained genes encoding extracellular proteins such as FBN1,FBN2 and THSB1, secreted proteins such as PAPPA, and integral membraneproteins such as LRP1 and GPR107. These proteins have no obviouscommonalities in function. Rather a GO analysis showed significantenrichment for location terms, such as extracellular region, basementmembrane, or perivitelline space (FIG. 8E). To reach these locations,proteins must pass through the secretion pathway, which often requirestranslation of mRNA at the endoplasmic reticulum (ER). Thus, it isproposed that the spatial pattern that were observed for these mRNAsreflects their co-translational enrichment at the ER. The enrichment ofthese mRNAs in the perinuclear region (FIGS. 8C and 8D, lightershading), where the rough ER resides, supports this conclusion.

Group II contained genes encoding the actin-binding proteins, includingfilamins FLNA and FLNC, talin TLN1, and spectrins SPTAN1 and SPTBN1; themicrotubule-binding protein CKAP5; and the motor proteins MYH10 andDYNC1H1. This group was enriched with GO terms such as cortical actincytoskeleton, actin filament binding, and cell-cell adherens junction(FIG. 8E). Beta-actin mRNA may be enriched near the cell periphery infibroblasts as are mRNAs that encode members of the actin-binding Arp2/3complex. The enrichment of group II mRNAs in the peripheral region ofthe cells (FIGS. 8C and 8D) suggests that the spatial distribution ofthe Group II genes might be related to the distribution of actincytoskeleton mRNAs.

FIGS. 8A-8E show distinct spatial distributions of RNAs observed in the140-gene measurements. FIG. 8A shows examples of the spatialdistributions observed for four different RNA species in a cell. FIG. 8Bis a matrix of the pairwise correlation coefficients describing thedegree with which the spatial distributions of each gene pair iscorrelated, shown together with the hierarchical clustering tree. Twostrongly correlating groups are indicated by the black boxes on thematrix and shading on the tree. FIG. 8C shows the spatial distributionsof all RNAs in the two groups in two example cells. Lighter symbols:group I genes; darker symbols: group II genes. FIG. 8D shows averagedistances for genes in group I and genes in group II to the cell edge orthe nucleus normalized to the average distances for all genes. Errorbars represent SEM across 7 data sets. FIG. 8E shows enrichment of GOterms in each of the two groups.

Example 8

This example illustrates measuring 1001 genes with a 14-bit MHD2 code.This example further increases the throughput of MERFISH measurements bysimultaneously imaging ˜1000 RNA species. This increase could beachieved with the MHD4 code by increasing the number of bits per codeword to 32 while maintaining the number of ‘1’ bits per word at four(FIG. 5B). While the stability of the samples across many hybridizationrounds (FIGS. 14A-14B) suggests that such an extension is potentiallyfeasible, an alternative approach is shown here that did not require anincrease in the number of hybridizations by relaxing the errorcorrection requirement but keeping the error detection capability. Forexample, by reducing the Hamming distance from 4 to 2, all 14-bit wordscould be used that contain four ‘1’ bits to encode 1001 genes and theseRNAs were probed with only 14 rounds of hybridization. However, becausea single error can produce a word equally close to two different codewords, error correction is no longer possible for this modifiedHamming-distance-2 (MHD2) code. Hence, it was expected that the callingrate would be lower and the misidentification rate to be higher withthis encoding scheme.

To evaluate the performance of this 14-bit MHD2 code, 16 of the 1001possible code words were set aside as misidentification controls andused the remaining 985 words to encode cellular RNAs. Among these 985RNAs included 107 RNA species probed in the 140-gene experiments as anadditional control. The 1001-gene experiments were performed in IMR90cells using a similar procedure as described above. To allow allencoding probes to be synthesized from a single 100,000-memberoligopool, the number of encoding probes per RNA species was reduced to˜94. Fluorescent spots corresponding to individual RNA molecules wereagain detected in each round of hybridization with the readout probesand, based on their on-off patterns, these spots were decoded into RNA(FIGS. 9A, 19A and 19B). 430 RNA species were detected in the cell shownin FIG. 9A, and similar results were obtained in ˜200 imaged cells in 3independent experiments.

As expected, the misidentification rate of this scheme was higher thanthat of the MHD4 code. 77% of all real RNA words were detected morefrequently than the median count for the misidentification controlsinstead of the 95% value observed in the MHD4 measurements. Using thesame confidence ratio analysis as described above, it was found that 73%(instead of 91% for the MHD4 measurements) of the 985 RNA species weremeasured with a confidence ratio larger than the maximum value observedfor the misidentification controls (FIG. 19C). RNA copy numbers measuredfrom these 73% RNA species showed excellent correlation with the bulkRNA sequencing results (Pearson correlation coefficient r=0.76; FIG. 9B,black). It is worth noting that the remaining 27% of the genes stillexhibit good, albeit lower, correlation with the bulk RNA sequencingdata (r=0.65; FIG. 9B, red), but the conservative measure of excludingthem from further analysis was taken.

The lack of an error correction capability also decreased the callingrate of each RNA species: When comparing the 107 RNA species common inboth the 1001-gene and 140-gene measurements, it was found that the copynumbers per cell of these RNA species were lower in the 1001-genemeasurements (FIGS. 9C and 19D). The total count of these RNAs per cellwas ˜⅓ of that observed in the 140-gene measurements. Thus the lack oferror correction in the MHD2 code produced a ˜3-fold decrease in thecalling rate, which is consistent with the ˜4-fold decrease in callingrate observed for the MHD4 code when error correction was not applied.As expected from the quantitative agreement between 140-genemeasurements and conventional smFISH results, comparison of the1001-gene measurements with conventional smFISH results for 10 RNAspecies also indicated a ˜3-fold drop in calling rate (FIG. 18C).Despite the expected reduction in calling rate, the good correlationsfound between the copy numbers observed in the 1001-gene measurementsand those observed in the 140-gene measurements, as well as inconventional smFISH and bulk RNA sequencing measurements, indicates thatthe relative abundance of these RNAs can be quantified with the MHD2encoding scheme.

Simultaneously imaging ˜1000 genes in individual cells substantiallyexpanded the ability to detect co-regulated genes. FIG. 10A shows thematrix of pairwise correlation coefficients determined from thecell-to-cell variations in the expression levels of these genes. Usingthe same hierarchical clustering analysis as described above, ˜100groups of genes with correlated expression were identified. Remarkably,nearly all of these ˜100 groups showed statistically significantenrichment of functionally related GO terms (FIG. 10B-FIG. 10C). Theseincluded some of the groups identified in the 140-gene measurements,such as the group associated with cell replication genes and the groupassociated with cell motility genes (FIGS. 10A and 10B and 10C, groups 7and 102), as well as many new groups. The groups identified hereincluded 46 RNA species lacking any previous GO annotations, for whichfunction based on their group association may be hypothesized. Forexample, KIAA1462 is part of the cell motility group, as also shown inthe 140-gene experiments, suggesting a potential role of this gene incell motility (FIG. 10A, group 102). Likewise, KIAA0355 is part of a newgroup enriched in genes associated with heart development (FIG. 10A,group 79), and C17orf70 is part of a group associated with ribosomal RNAprocessing (FIG. 10A, group 22). Using these groupings, cellularfunctions for 61 transcription factors and other partially annotatedproteins of unknown functions may be hypothesized. For example, thetranscription factors Z3CH13 and CHD8 are both members of the cellmotility group, suggesting their potential role in the transcriptionalregulation of cell motility genes.

FIGS. 9A-9C show simultaneous measurements of 1001 genes in single cellsusing MERFISH with a 14-bit MHD2 code. FIG. 9A shows the localizationsof all detected single molecules in a cell colored based on theirmeasured binary words. Inset: the composite, false-colored fluorescentimage of the 14 hybridization rounds for the boxed sub-region withnumbered circles indicating potential RNA molecules. Circles indicateunidentifiable molecules, the binary words of which do not match any ofthe 14-bit MHD2 code words. Images of individual hybridization round areshown in FIG. 19A. FIG. 9B is a scatter plot of the average copy numberper cell measured in the 1001-gene experiments versus the abundancemeasured via bulk sequencing. The upper symbols are for the 73% of genesdetected with confidence ratios higher than the maximum ratio observedfor the misidentification controls. The Pearson correlation coefficientis 0.76 with a p-value of 3×10⁻¹³³ The lower symbols are for theremaining 27% of genes. The Pearson correlation coefficient is 0.65 witha p-value of 3×10⁻³³. FIG. 9C is a scatter plot of the average copynumber for the 107 genes shared in both the 1001-gene measurement withthe MHD2 code and the 140-gene measurement with the MHD4 code. ThePearson correlation coefficient is 0.89 with a p-value of 9×10⁻³⁰. Thedashed line corresponds to the y=x line.

FIGS. 10A-10C show co-variation analysis of the RNA species measured inthe 1001-gene measurements. FIG. 10A is a matrix of all pairwisecorrelation coefficients of the cell-to-cell variation in expression forthe measured genes shown with the hierarchical clustering tree. The ˜100identified groups of correlated genes are indicated by shading on thetree. Zoom in of four of the groups described in the text are shown onthe right. FIG. 10B-FIG. 10C is an enrichment of 20 selected,statistically significantly enriched GO terms in the four groups.

FIGS. 19A-19D show decoding and error assessment of the 1001-geneexperiments. FIG. 19A shows images of the boxed sub-region of the cellin FIG. 9A for each of 14 hybridization rounds. The final panel is acomposite image of these 14 rounds. Circles indicate fluorescent spotsthat have been identified as potential RNA molecules. Some circles inthe composite image indicate unidentifiable molecules, the binary wordsof which do not match any of the 14-bit MHD2 code words. FIG. 9B showsthe corresponding binary word for each of the spots identified in FIG.9A with the RNA species to which it is decoded. ‘unidentified’ impliesthat the measured binary word does not match any of the 1001 code words.FIG. 19C shows the normalized confidence ratios measured for the 985 RNAspecies (left) and the 16 misidentification control words not targetedto any RNA (right). The normalized confidence ratio is defined as inFIG. 6F. FIG. 19D shows a histogram of the reduction in detectedabundance of 107 genes present in both the 1001-gene experiments and the140-gene experiments. “Fold decrease in copy number” is defined as theaverage number of RNA molecules per cell for each species measured inthe 140-gene experiments divided by the corresponding average numbermeasured in the 1001-gene experiments.

FIG. 18C is a comparison of the average RNA copy numbers per cellmeasured in the 1001-gene experiments using the MHD2 code to thosedetermined by conventional smFISH for 10 genes. The average ratio of thecopy number measured using the MHD2 measurements to that measured usingconventional smFISH was 0.30+/−0.05 (mean+/−SEM across 10 genes). Thedashed line corresponds to the y=x line and the dotted line correspondsto the y=0.30x line.

Example 9

The above examples illustrate a highly multiplexed detection scheme forsystems-level RNA imaging in single cells. Using combinatorial labeling,sequential hybridization and imaging, and two different error-robustencoding schemes, either 140 or 1001 genes in hundreds of individualhuman fibroblast cells were simultaneously imaged. Of the two encodingschemes presented here, the MHD4 code is capable of both error detectionand error correction, and hence can provide a higher calling rate and alower misidentification rate than the MHD2 code, which instead can onlydetect but cannot correct errors. MHD2, on the other hand, provides afaster scaling of the degree of multiplexing with the number of bitsthan MHD4. Other error-robust encoding schemes can also be used for suchmultiplexed imaging, and experimenters can set the balance betweendetection accuracy and ease of multiplexing based on the specificrequirements of the experiments.

By increasing the number of bits in the code words, it should bepossible to further increase the number of detectable RNA species usingMERFISH with, for example, a MHD4 or MHD2 code. For example, using theMHD4 code with 32 total bits and four or six ‘1’ bits would increase thenumber of addressable RNA species to 1,240 or 27,776, respectively. Thelatter is the approximate scale of the human transcriptome. Thepredicted misidentification and calling rates are still reasonable forthe 32-bit MHD4 code (shown in FIGS. 5C and 5D for the MHD4 code withfour ‘1’ bits and similar rates were calculated for the MHD4 code withsix ‘1’ bits). If more accurate measurements are desired, an additionalincrease in the number of bits would allow the use of encoding schemeswith a Hamming distance greater than 4, further enhancing the errordetection and correction capability. While an increase in the number ofbits by adding more hybridization rounds would increase the datacollection time and potentially lead to sample degradation, theseproblems could be mitigated by utilizing multiple colors to readoutmultiple bits in each round of hybridization.

As the degree of multiplexing is increased, it is important to considerthe potential increase in the density of RNAs that need to be resolvedin each round of imaging. Based on the imaging and sequencing results,it can be estimated that including the whole transcriptome of the IMR90cells would lead to a total RNA density of ˜200 molecules/micrometer³.Using the current imaging and analysis methods, 2-3molecules/micrometer³ per hybridization round could be resolved, whichwould reach a total RNA density of ˜20 molecules/micrometer³ after 32rounds of hybridization. This density should allow all but the top 10%most expressed genes to be imaged simultaneously or a subset of geneswith even higher expression levels to be included. By utilizing moreadvanced image analysis algorithms to better resolve overlapping imagesof individual molecules, such as compressed sensing, it is possible toextend the resolvable density by ˜4-fold and thus allow all but the top2% most expressed genes to be imaged all together.

These examples have illustrated the utility of the data derived fromhighly multiplexed RNA imaging by using co-variation and correlationanalysis to reveal distinct sub-cellular distribution patterns of RNAs,to constrain gene regulatory networks, and to predict functions for manypreviously unannotated or partially annotated genes with unknownfunctions. Given its ability to quantify RNAs across a wide range ofabundances without amplification bias while preserving native context,systems and methods such as MERFISH will allow many applications of insitu transcriptomic analyses of individual cells in culture or complextissues.

Example 10

Following are various materials and methods used in the above examples.

Probe design. Each RNA species in the target set was randomly assigned abinary code word either from all 140 possible code words of the 16-bitMHD4 code or from all 1001 possible code words of the 14-bit MHD2 code.

Array-synthesized oligopools were used as templates to make the encodingprobes. The template molecule for each encoding probe contained threecomponents: i) a central targeting sequence for in situ hybridization tothe target RNA, ii) two flanking readout sequences designed to hybridizeeach of two distinct readout probes, and iii) two flanking primersequences to allow enzymatic amplification of the probes (FIG. 13 ). Thereadout sequences were taken from the 16 possible readout sequences,each corresponding to one hybridization round. The readout sequenceswere assigned to the encoding probes such that for any RNA species eachof the 4 readout sequences were distributed uniformly along the lengthof the target RNA and appeared at the same frequency. Template moleculesfor the 140-gene library also included a common 20-nucleotide (nt)priming region between the first PCR primer and the first readoutsequence. This priming sequence was used for the reverse transcriptionstep described below.

Multiple experiments were embedded in a single array-synthesizedoligopool, and PCR was used to selectively amplify only the oligosrequired for a specific experiment. Primer sequences for this indexedPCR reaction were generated from a set of orthogonal 25-nt sequences.These sequences were trimmed to 20 nt and selected for i) a narrowmelting temperature range (70° C. to 80° C.), ii) the absence ofconsecutive repeats of 3 or more identical nucleotides, and iii) thepresence of a GC clamp, i.e. one of the two 3′ terminal bases must be Gor C. To further improve specificity, these sequences were then screenedagainst the human transcriptome using BLAST+, and primers with 14 ormore contiguous bases of homology were eliminated. Finally, BLAST+ wasagain used to identify and exclude primers that had an 11-nt homologyregion at the 3′ end of any other primer or a 5-nt homology region atthe 3′ end ofthe T7 promoter. The forward primer sequences (Primer 1)were determined as described above, whereas the reverse primers eachcontain a 20-nt sequence as described above plus a 20-nt T27 promotersequence to facilitate amplification via in vitro transcription (Primer2). The primer sequences used in the 140-gene and 1001-gene experimentsare listed below.

TABLE 2 Primer 2 Sequence Experiment Primer 1 Sequence(T7 promoter plus the reverse Name (Index Primer 1)complement of Index Primer 2) 140-gene GTTGGTCGGCACTTGGGTGTAATACGACTCACTATAGGGAAAGCCGG Codebook 1 C TTCATCCGGTGG (SEQ ID NO: 21)(SEQ ID NO: 18) 140-gene CGATGCGCCAATTCCGGTTCTAATACGACTCACTATAGGGTGATCATC Codebook 2 (SEQ ID NO: 19)GCTCGCGGGTTG (SEQ ID NO: 22) 1001-gene CGCGGGCTATATGCGAACCTAATACGACTCACTATAGGGCGTGGAGG G GCATACAACGC (SEQ ID NO: 23)(SEQ ID NO: 20)

30-nt-long readout sequences were created by concatenating fragments ofthe same orthogonal primer set generated above by combining one 20-ntprimer with a 10-nt fragment of another. These readout sequences werethen screened, using BLAST+, for orthogonality with the index primersequences and other readout sequences (no more than 11 nt of homology)and for potential off-target binding sites in the human genome (no morethan 14 nt of homology). Fluorescently labeled readout probes withsequences complementary to the readout sequences were used to probethese readout sequences, one in each hybridization round. All usedreadout probes sequences are listed below.

TABLE 3 Bit Readout probes  1 CGCAACGCTTGGGACGGTTCCAATCGGATC/3Cy5Sp/SEQ ID NO: 24  2 CGAATGCTCTGGCCTCGAACGAACGATAGC/3Cy5Sp/ SEQ ID NO: 25  3ACAAATCCGACCAGATCGGACGATCATGGG/3Cy5Sp/ SEQ ID NO: 26  4CAAGTATGCAGCGCGATTGACCGTCTCGTT/3Cy5Sp/ SEQ ID NO: 27  5GCGGGAAGCACGTGGATTAGGGCATCGACC/3Cy5Sp/ SEQ ID NO: 28  6AAGTCGTACGCCGATGCGCAGCAATTCACT/3Cy5Sp/ SEQ ID NO: 29  7CGAAACATCGGCCACGGTCCCGTTGAACTT/3Cy5Sp/ SEQ ID NO: 30  8ACGAATCCACCGTCCAGCGCGTCAAACAGA/3Cy5Sp/ SEQ ID NO: 31  9CGCGAAATCCCCGTAACGAGCGTCCCTTGC/3Cy5Sp/ SEQ ID NO: 32 10GCATGAGTTGCCTGGCGTTGCGACGACTAA/3Cy5Sp/ SEQ ID NO: 33 11CCGTCGTCTCCGGTCCACCGTTGCGCTTAC/3Cy5Sp/ SEQ ID NO: 34 12GGCCAATGGCCCAGGTCCGTCACGCAATTT/3Cy5Sp/ SEQ ID NO: 35 13TTGATCGAATCGGAGCGTAGCGGAATCTGC/3Cy5Sp/ SEQ ID NO: 36 14CGCGCGGATCCGCTTGTCGGGAACGGATAC/3Cy5Sp/ SEQ ID NO: 37 15GCCTCGATTACGACGGATGTAATTCGGCCG/3Cy5Sp/ SEQ ID NO: 38 16GCCCGTATTCCCGCTTGCGAGTAGGGCAAT/3Cy5Sp/ SEQ ID NO: 39The readout probes used for the 140-gene libraries were probes 1 through16. The readout probes used for the 1001-gene experiment were probes 1through 14. /3Cy5Sp/indicates a 3′ Cy5 modification.

To design the central targeting sequences of the encoding probes, theabundance of different transcripts in IMR90 cells using Cufflinks v2.1,total RNA data from the ENCODE project, and human genome annotationsfrom Gencode v18 were complied. Probes were designed from gene modelscorresponding to the most abundant isoform using OligoArray2.1 with thefollowing constraints: the target sequence region is 30-nt long; themelting temperatures of the hybridized region of the probe and cellularRNA target is greater than 70° C.; there is no cross hybridizationtargets with melting temperatures greater than 72° C.; there is nopredicted internal secondary structures with melting temperaturesgreater than 76° C.; and there is no contiguous repeats of 6 or moreidentical nucleotides. Melting temperatures were adjusted to optimizethe specificity of these probes and minimize secondary structure whilestill producing sufficient numbers of probes for the libraries. Todecrease computational cost, isoforms were divided into 1-kb regions forprobe design. Using BLAST+, all potential probes that mapped to morethan one cellular RNA species were rejected. Probes with multipletargets on the same RNA were kept.

For each gene in the 140-gene experiments, 198 putative encoding probesequences were generated by concatenating the appropriate index primers,readout sequences, and targeting regions as shown in FIG. 13 . Toaddress the possibility that concatenation of these sequences introducednew regions of homology to off-target RNAs, BLAST+ was used to screenthese putative sequences against all human rRNA and tRNA sequences aswell as highly expressed genes (genes with FPKM>10,000). Probes withgreater than 14 nt of homology to rRNAs or tRNAs or greater than 17 ntof homology to highly expressed genes were removed. After these cuts,there were ˜192 (with a standard deviation of 2) probes per gene forboth MHD4 codebooks used in the 140-gene experiments. The same protocolfor the 1001-gene experiments was used, as follows: Starting with 96putative targeting sequences per gene, ˜94 (with a standard deviation of6) encoding probes per gene were obtained after these additionalhomology cuts. The number of encoding probes per RNA was decreased forthe 1001-gene experiments so that these probes could be synthesized froma single 100,000-member oligopool as opposed to two separate pools. Eachencoding probe was designed to contain two of the four readout sequencesassociated with each code word, hence only half of the bound encodingprobes can bind readout probe during any given hybridization round. ˜192or ˜94 encoding probes per RNA were used to obtain highsignal-to-background ratios for individual RNA molecules. The number ofencoding probes per RNA could be substantially reduced but still allowsingle RNA molecules to be identified. In addition, increasing thenumber of readout sequences per encoding probe or using opticalsectioning methods to reduce the fluorescence background may allowfurther reduction in the number of the encoding probes per RNA.

Two types of misidentification controls were designed. The first control(blank words) were not represented with encoding probes. The second typeof control (no-target words) had encoding probes that were not targetingany cellular RNA. The targeting regions of these probes were composed ofrandom nucleotide sequences subject to the same constraints used todesign the RNA targeting sequences described above. Moreover, theserandom sequences were screened against the human transcriptome to ensurethat they contain no significant homology (>14-nt) to any human RNA. The140-gene measurements contained 5 blank words and 5 no-target words. The1001-gene measurements contained 11 blank words and 5 no-target words.

Probe synthesis. The encoding probes were synthesized using thefollowing steps, and this synthesis protocol is illustrated in FIG. 13 .

Step 1: The template oligopool (CustomArray) was amplified vialimited-cycle PCR on a Bio-Rad CFX96 using primer sequences specific tothe desired probe set. To facilitate subsequent amplification via invitro transcription, the reverse primer contained the T7 promoter. Allprimers were synthesized by IDT. This reaction was column purified (ZymoDNA Clean and Concentrator; D4003).

Step 2: The purified PCR products were then further amplified ˜200-foldand converted into RNA via a high yield in vitro transcription accordingto the manufacturer's instructions (New England Biolabs, E2040S). Each20 microliter reaction contained ˜1 microgram of template DNA fromabove, 10 mM of each NTP, 1× reaction buffer, 1× RNase inhibitor(Promega RNasin, N2611) and 2 microliters of the T7 polymerase. Thisreaction was incubated at 37° C. for 4 hours to maximize yield. Thisreaction was not purified before the following steps.

Step 3: The RNA products from the above in vitro transcription reactionwere then converted back into DNA via a reverse transcription reaction.Each 50 microliter reaction contained the unpurified RNA produce fromStep 2 supplemented with 1.6 mM of each dNTP, 2 nmol of a reversetranscription primer, 300 units of Maxima H- reverse transcriptase(Thermo Scientific, EP0751), 60 units of RNasin, and a final 1×concentration of the Maxima RT buffer. This reaction was incubated at50° C. for 45 minutes, and the reverse transcriptase was inactivated at85° C. for 5 minutes. The templates for the 140-gene libraries contain acommon priming region for this reverse transcription step; thus, asingle primer was used for this step when creating these probes. Itssequence was CGGGTTTAGCGCCGGAAATG (SEQ ID NO: 40). A common primingregion was not included for the 1001-gene library; thus, the reversetranscription was conducted with the forward primer:

(SEQ ID NO: 20) CGCGGGCTATATGCGAACCG.

Step 4: To remove the template RNA, 20 microliters of 0.25 M EDTA and0.5 N NaOH was added to the above reaction to selectively hydrolyze RNA,and the sample was incubated at 95° C. for 10 minutes. This reaction wasthen immediately purified by column purification using a100-microgram-capacity column (Zymo Research, D4030) and the Zymo OligoClean and Concentrator protocol. The final probes were eluted in 100microliters of RNase-free deionized water, evaporated in a vacuumconcentrator, and then resuspended in 10 μL of encoding hybridizationbuffer (see below). Probes were stored at −20° C. Denaturingpoly-acrylamide gel electrophoresis and absorption spectroscopy wereused to confirm the quality of the probes and revealed that this probesynthesis protocol converts 90-100% of the reverse-transcription primerinto full length probe and of the probe that is constructed, 70-80% isrecovered during the purification step.

Fluorescently labeled readout probes have sequences complementary to thereadout sequences described above and a Cy5 dye attached at the 3′ end.These probes were obtained from IDT and HPLC purified.

Sample preparation and labeling with encoding probes. Human primaryfibroblasts (American Type Culture Collection, IMR90) were used in thiswork. These cells are relatively large and flat, facilitating wide-fieldimaging without the need for optical sectioning. Cells were culturedwith Eagle's Minimum Essential Medium. Cells were plated on 22-mm, #1.5coverslips (Bioptechs, 0420-0323-2) at 350,000 cells/coverslip andincubated at 37° C. with 5% CO₂ for 48-96 hours within petri dishes.Cells were fixed for 20 minutes in 4% paraformaldehyde (ElectronMicroscopy Sciences, 15714) in 1× phosphate buffered saline (PBS;Ambion, AM9625) at room temperature, reduced for 5 minutes with 0.1% w/vsodium borohydride (Sigma, 480886) in water to reduce backgroundfluorescence, washed three times with ice-cold 1×PBS, permeabilized for2 minutes with 0.5% v/v Triton (Sigma, T8787) in 1×PBS at roomtemperature, and washed three times with ice cold 1×PBS.

Cells were incubated for 5 minutes in encoding wash buffer comprising 2×saline-sodium citrate buffer (SSC) (Ambion, AM9763), 30% v/v formamide(Ambion, AM9342), and 2 mM vanadyl ribonucleoside complex (NEB, S1402S).10 microliters of 100 micromolar(140-gene experiments) or 200 micromolar(1001-gene experiments) encoding probes in encoding hybridization bufferwas added to the cell-containing coverslip and spread uniformly byplacing another coverslip on top of the sample. Samples were thenincubated in a humid chamber inside a 37° C.-hybridization oven for18-36 hours. Encoding hybridization buffer is composed of encoding washbuffer supplemented with 1 mg/mL yeast tRNA (Life Technologies,15401-011) and 10% w/v dextran sulfate (Sigma, D8906-50G).

Cells were then washed with primary encoding wash buffer, incubated at47° C. for 10 minutes, and this wash was repeated for a total of threetimes. A 1:1000 dilution of 0.2-micrometer-diameter carboxylate-modifiedorange fluorescent beads (Life Technologies, F-8809) in 2×SSC wassonicated for 3 minutes and then incubated with the sample for 5minutes. The beads were used as fiducial markers to align imagesobtained from multiple successive rounds of hybridization, as describedbelow. The sample was washed once with 2×SSC, and then post-fixed with4% v/v paraformaldehyde in 2×SSC at room temperature for 30 minutes. Thesample was then washed three times with 2×SSC and either imagedimmediately or stored for no longer than 12 hours at 4° C. prior toimaging. All solutions were prepared as RNase-free.

MERFISH imaging with multiple successive rounds of hybridization. Thesample coverslip was assembled into a Bioptech's FCS2 flow chamber, andthe flow through this chamber was controlled via a home-built fluidicssystem composed of three computer-controlled 8-way valves (Hamilton, MVPand HVXM 8-5) and a computer-controlled peristaltic pump (Rainin,Dynamax RP-1). The sample was imaged on a home-built microscopeconstructed around an Olympus IX-71 body and a 1.45 NA, 100× oilimmersion objective and configured for oblique incidence excitation. Theobjective was heated to 37° C. with a Bioptechs objective heater.Constant focus was maintained throughout the imaging process with ahome-built, auto-focusing system. Illumination was provided at 641 nm,561 nm, and 405 nm using solid state lasers (MPB communications,VFL-P500-642; Coherent, 561-200CWCDRH; and Coherent, 1069413/AT) forexcitation of the Cy5-labeled readout probes, the fiducial beads, andnuclear counterstains, respectively. These lines were combined with acustom dichroic (Chroma, zy405/488/561/647n52RP-UF1) and the emissionwas filtered with a custom dichroic (Chroma,ZET405/488/561/647-656/752m). Fluorescence was separated with a QuadView(Photometrics) using the dichroics T560lpxr, T6501pxr, 750dcxxr (Chroma)and the emission filters ET525/50m, WT59550m-2f, ET700/75m, HQ7701p(Chroma) and imaged with an EMCCD camera (Andor, iXon-897). The camerawas configured so that a pixel corresponds to 167 nm in the sampleplane. The entire system was fully automated, so that imaging and fluidhandling were performed for the entire experiment without userintervention.

Sequential hybridization, imaging, and bleaching proceeded as follows. 1mL of 10 nM of the appropriate fluorescently labeled readout probe inreadout hybridization buffer (2×SSC; 10% v/v formamide; 10% w/v dextransulfate, and 2 mM vanadyl ribonucleoside complex) was flown across thesample, flow was stopped, and the sample was incubated for 15 minutes.Then 2 mL of readout wash buffer (2×SSC, 20% v/v formamide; and 2 mMvanadyl ribonucleoside complex) was flown across the sample, flow wasstopped, and the sample was incubated for 3 minutes. 2 mL of imagingbuffer comprising 2×SSC, 50 mM TrisHCl pH 8, 10% w/v glucose, 2 mMTrolox (Sigma-Aldrich, 238813), 0.5 mg/mL glucose oxidase(Sigma-Aldrich, G2133), and 40 microgram/mL catalase (Sigma-Aldrich,C30) was flown across the sample. Flow was then stopped, and thenapproximately 75 to 100 regions were exposed to ˜25 mW 642-nm and 1 mWof 561-nm light and imaged. Each region was 40 micrometers by 40micrometers. The laser powers were measured at the microscope backport.Because the imaging buffer is sensitive to oxygen, the ˜50 mL of imagingbuffer used for a single experiment was made fresh at the beginning ofthe experiment and then stored under a layer of mineral oil throughoutthe measurement. Buffer stored in this fashion was stable for more than24 hours.

After imaging, the fluorescence of the readout probes was extinguishedvia photobleaching. The sample was washed with 2 mL of photobleachingbuffer (2×SSC and 2 mM vanadyl ribonucleoside complex), and each imagedregion of the sample was exposed to 200 mW of 641-nm light for 3 s. Toconfirm the efficacy of this photobleaching treatment, imaging bufferwas reintroduced, and the sample was imaged as described above.

The above hybridization, imaging, and photobleaching process wasrepeated either 16 times for the 140-gene measurements using the MHD4code or 14 times for the 1001-gene measurements using the MHD2 code. Anentire experiment was typically completed in ˜20 hours.

Following completion of imaging, 2 mL of a 1:1000 dilution of Hoescht(ENZ-52401) in 2×SSC was flown through the chamber to label the nucleiof the cells. The sample was then washed immediately with 2 mL of 2×SSCfollowed by 2 mL of imaging buffer. Each region of the sample was thenimaged once again with ˜1 mW of 405-nm light.

Because cells were imaged using wide-field imaging withoblique-incidence illumination, without optical sectioning andz-scanning, the fraction of individual RNA species that was outside theaxial range of the imaging geometry was quantified for 6 different RNAspecies using conventional smFISH. For this purpose, these cells wereoptically sectioned by collecting stacks of images at different focaldepths through the entire depth of the cells. The images were aligned inconsecutive focal planes and then computed for each cell the fraction ofRNAs that were detected in the three-dimensional stack but not in thebasal focal plane. It was found that only a small fraction, 15%+/−1%(Mean+/−SEM across six different RNA species) of RNA molecules wereoutside the imaging range of a fixed focal plane without z-scanning.These measurements also confirmed that the excitation geometryilluminated the full depth of the cells. Any optical sectioningtechnique could be employed in MERFISH to allow the imaging of RNAs inthicker cells or tissues.

Construction of measured words. Fluorescent spots were identified andlocalized in each image using a multi-Gaussian-fitting algorithmassuming a Gaussian with a uniform width of 167 nm. This algorithm wasused to allow partially overlapping spots to be distinguished andindividually fit. RNA spots were distinguished from background signal,i.e. signal arising from probes bound non-specifically, by setting theintensity threshold required to fit a spot with this software. Due tovariation in the brightness of spots between rounds of hybridization,this threshold was adjusted appropriately for each hybridization roundto minimize the combined average of the 1-->0 and 0-->1 error ratesacross all hybridization rounds (140-gene measurements) or to maximizethe ratio of the number of measured words with four ‘1’ bits to thosewith three or five ‘1’ bits (1001-gene measurements). The location ofthe fiducial beads was identified in each frame using a fastersingle-Gaussian fitting algorithm.

Images of the same sample region in different rounds of hybridizationwere registered by rotating and translating the image to align the twofiducial beads within the same image that were most similar in locationafter a coarse initial alignment via image correlation. All images werealigned to a coordinate system established by the images collected inthe first round of hybridization. The quality of this alignment wasdetermined from the residual distance between five additional fiducialbeads, and alignment error was typically −20 nm.

Fluorescence spots in different hybridization rounds were connected intoa single string, corresponding to a potential RNA molecule, if thedistance between spots was smaller than 1 pixel (167 nm). For eachstring of spots, the on-off sequence of fluorescent signals in allhybridization rounds were used to assign a binary word to the potentialRNA molecule, in which ‘1’ was assigned to the hybridization rounds thatcontained a fluorescent signal above threshold and ‘0’ was assigned tothe other hybridization rounds. Measured words were then decoded intoRNA species using the 16-bit MHD4 code or the 14-bit MHD2 code discussedabove. In the case of the 16-bit MHD4 code, if the measured binary wordmatched the code word of a specific RNA perfectly or differed from thecode word by one single bit, it was assigned to that RNA. In the case ofthe 14-bit MHD2 code, only if the measured binary word matched the codeword of a specific RNA perfectly, was it assigned to that RNA. Todetermine the copy number per cell, the number of each RNA species wascounted in individual cells within each 40 micrometer by 40 micrometerimaging area. It is noted that this number accounts for the majority butnot all RNA molecules within a cell because a fraction of the cell couldbe outside the imaging area or focal depth. Tiling images of adjacentareas and adjacent focal planes could be employed to improve thecounting accuracy.

In the 140-gene experiments, some regions of the cell nucleusoccasionally contained too much fluorescence signal to properly identifyindividual RNA spots. In the 1001-gene experiments, the cell nucleusgenerally contained too much fluorescent signal to allow identificationof individual RNA molecules. These bright regions were excluded from allsubsequent analysis. This work focuses on mRNAs, which are enriched inthe cytoplasm. To estimate the fraction of mRNAs missed by excluding thenucleus region, conventional smFISH was used to quantify the fraction ofmolecules found inside the nucleus for six different mRNAs species. Itwas found that only 5%+/−2% (Mean+/−SEM across six RNA species) of theseRNA molecules are found in the nucleus. Employment of super-resolutionimaging and/or optical sectioning could potentially allow individualmolecules in these dense nucleus regions to be identified, which wouldbe particularly useful for probing those non-coding RNAs that areenriched in the nucleus.

smFISH measurements of individual genes. Pools of 48fluorescently-labeled (Quasar 670) oligonucleotide probes per RNA werepurchased from Biosearch Technologies. 30-nt probe sequences were takendirectly from a random subset of the targeting regions used for themultiplexed measurements. Cells were fixed and permeabilized asdescribed above. 10 microliters of 250 nM oligonucleotide probes inencoding hybridization buffer (described above) was added to thecell-containing coverslip and spread uniformly by placing anothercoverslip on top of the sample. Samples were then incubated in a humidchamber inside a 37° C.-hybridization oven for 18 hours. Cells were thenwashed with encoding wash buffer (described above) at 37° C. for 10minutes, and this wash was repeated for a total of three times. Thesample was then washed three times with 2×SSC and imaged in imagingbuffer using the same imaging geometry as described above for MERFISH.

Bulk RNA sequencing. Total RNA was extracted from IMR90 cells culturedas above using the Zymo Quick RNA MiniPrep kit (R1054) according to themanufacturer's instructions. polyA RNA was then selected (NEB; E7490),and a sequencing library was constructed using the NEBNext Ultra RNAlibrary preparation kit (NEB; E7530), amplified with customoligonucleotides, and 150-bp reads were obtained from on a MiSeq. Thesesequences were aligned to the human genome (Gencode v18) and isoformabundance was computed with cufflinks.

Calculation of the predicted scaling and error properties of differentencoding schemes. Analytic expressions were derived for the dependenceof the number of possible code words, the calling rate, and themisidentification rate on N. The calling rate is defined as the fractionof RNA molecules that are properly identified. The misidentificationrate is defined as the fraction of RNA molecules that are misidentifiedas a wrong RNA species. For encoding schemes with an error-detectioncapability, the calling rate and misidentification rate does not add upto 1 because a fraction of the molecules not called properly can bedetected as errors and discarded and, hence, not misidentified as awrong species. These calculations assume that the probability ofmisreading bits is constant for all hybridization rounds but differs forthe 1-->0 and 0-->1 errors. Experimentally measured average 1-->0 and0-->1 error rates (10% and 4% respectively) were used for the estimatesshown in FIGS. 5B-5D. For simplicity, the word corresponding to all ‘0’swas not removed from calculations.

For the simple binary encoding scheme in which all possible N-bit binarywords are assigned to unique RNA species, the number of possible codewords is 2^(N). The number of words that could be used to encode RNA isactually 2^(N)−1 because the code word ‘00 . . . 0’ does not containdetectable fluorescence in any hybridization round, but for simplicitythe word corresponding to all ‘0’s was not removed from subsequentcalculations. The error introduced by this approximation is negligible.For any given word with m ‘1’s and N−m ‘0’s the probability of measuringthat word without error (the fraction of RNAs that is properly called)is:

(1−p ₁)^(m)(1−p ₀)^(N-m),  (1)

where p₁ is 1-->0 error rate and p₀ is 0-->1 error rate per bit. Becausedifferent words in this simple binary encoding scheme can have differentnumbers of ‘1’ bits, the calling rate for different words will differ ifp₁≠p₀. The average calling rate, reported in FIG. 5C, was determinedfrom the weighted average of the value of Eq. (1) for all words. Thisweighted average is:

$\begin{matrix}{{\frac{1}{2^{N}}{\sum\limits_{m = 0}^{N}{\begin{pmatrix}N \\m\end{pmatrix}\left( {1 - p_{1}} \right)^{m}\left( {1 - p_{0}} \right)^{N - m}}}},} & (2)\end{matrix}$

where

$\begin{pmatrix}N \\m\end{pmatrix}$

is the binomial coefficient and corresponds to the number of words withm ‘1’ bits in this encoding scheme. Since in this encoding scheme everyerror produces a binary word that encodes a different RNA, the averagemisidentification rate for this encoding scheme, reported in FIG. 5D,follows directly from (2):

$\begin{matrix}{1 - {\frac{1}{2^{N}}{\sum\limits_{m = 0}^{N}{\begin{pmatrix}N \\m\end{pmatrix}\left( {1 - p_{1}} \right)^{m}{\left( {1 - p_{0}} \right)^{N - m}.}}}}} & (3)\end{matrix}$

To calculate the scaling and error properties of the extended Hammingdistance 4 (HD4) code, the generator matrix for the desired number ofdata bits using standard methods was first created. The generator matrixdetermines the specific words that are present in a given encodingscheme and was used to directly determine the number of encoded words asa function of the number of bits. In this encoding scheme, the callingrate corresponds to the fraction of words measured without error as wellas the fraction of words measured with a single-bit error. For codewords with m ‘1’ bits, this fraction is determined by the followingexpression:

(1−p ₁)^(m)(1−p ₀)^(N-m) +mp ₁ ¹(1−p ₁)^(m-1)(1−p ₀)^(N-m)+(N−m)p ₀¹(1−p ₁)^(m)(1−p ₀)^(N-m-1)  (4)

where the first term is the probability of not making any errors, thesecond term corresponds to the total probability of making one 1-->0error at any of the m ‘1’ bits without making any other 0-->1 errors,and the final term corresponds to the total probability of making one0-->1 error at any of the N−m ‘0’ bits without making any 1-->0 errors.Because the number of ‘1’ bits can differ between words in this encodingscheme, the average calling rate reported in FIG. 5C was computed from aweighted average over Eq. (4) for different values of m. The weight foreach term was determined from the number of words that contain m ‘1’bits as determined from the generator matrix described above.

Because RNA-encoding words are separated by a minimum Hamming distanceof 4, at least 4 errors are required to switch one word into another. Iferror correction is applied, then 3 or 5 errors could also convert oneRNA into another. Thus, the misidentification rate from all possiblecombinations of 3-bit, 4-bit and 5-bit errors was estimated for codewords with m ‘1’ bits. Technically, >5-bit errors could also convert oneRNA into another, but the probability of making such errors isnegligible because of the small per-bit error rate. This expression wasapproximated with:

$\begin{matrix}{{\sum\limits_{i = 0}^{4}{\begin{pmatrix}m \\i\end{pmatrix}\begin{pmatrix}{N - m} \\{4 - i}\end{pmatrix}p_{1}^{i}{p_{0}^{4 - i}\left( {1 - p_{1}} \right)}^{m - i}\left( {1 - p_{0}} \right)^{N - m - {({4 - i})}}}} + {\sum\limits_{i = 0}^{3}{\begin{pmatrix}m \\i\end{pmatrix}\begin{pmatrix}{N - m} \\{3 - i}\end{pmatrix}p_{1}^{i}{p_{0}^{3 - i}\left( {1 - p_{1}} \right)}^{m - i}\left( {1 - p_{0}} \right)^{N - m - {({3 - i})}}}} + {\sum\limits_{i = 0}^{5}{\begin{pmatrix}m \\i\end{pmatrix}\begin{pmatrix}{N - m} \\{5 - i}\end{pmatrix}p_{1}^{i}{p_{0}^{5 - i}\left( {1 - p_{1}} \right)}^{m - i}\left( {1 - p_{0}} \right)^{N - m - {({5 - i})}}}} +} & (5)\end{matrix}$

The first sum corresponds to all of the ways in which exactly fourmistakes can be made. Similarly, the second and third sums correspond toall of the ways in which exactly three or five mistakes can be made. Eq.(5) provides an upper bound for the misidentification rate because notall three, four, or five bit errors produce a word that matches or wouldbe corrected to another legitimate word. Again because the number of ‘1’bits can differ between words, the average misidentification ratereported in FIG. 5D is calculated as a weighted average of Eq. (5) overthe number of words that have m ‘1’ bits.

To generate the MHD4 code where the number of ‘1’ bits for each codeword is set to 4, the HD4 codes were first generated as described above,and then all code words that did not contain four ‘1’s were removed. Thecalling rate of this code, reported in FIG. 5C, was directly calculatedfrom Eq. (4) but with m=4 because all code words in this code have four‘1’ bits. The misidentification rate of this code, reported in FIG. 5D,was calculated by modifying Eq. (5) with the following considerations:(i) the number of ‘1’ bits, m, was set to 4 and (ii) errors that producewords that do not contain three, four, or five ‘1’ bits were excluded.Thus, the expression in Eq. (5) was simplified to

$\begin{matrix}{{\begin{pmatrix}4 \\2\end{pmatrix}\begin{pmatrix}{N - 4} \\2\end{pmatrix}p_{1}^{2}{p_{0}^{2}\left( {1 - p_{1}} \right)}^{2}\left( {1 - p_{0}} \right)^{N - 6}} + {\begin{pmatrix}4 \\1\end{pmatrix}\begin{pmatrix}{N - 4} \\2\end{pmatrix}p_{1}^{2}{p_{0}^{2}\left( {1 - p_{1}} \right)}^{3}\left( {1 - p_{0}} \right)^{N - 6}} + {\begin{pmatrix}4 \\2\end{pmatrix}\begin{pmatrix}{N - 4} \\1\end{pmatrix}p_{1}^{2}{p_{0}^{2}\left( {1 - p_{1}} \right)}^{2}\left( {1 - p_{0}} \right)^{N - 5}} + {\begin{pmatrix}4 \\2\end{pmatrix}\begin{pmatrix}{N - 4} \\3\end{pmatrix}p_{1}^{2}{p_{0}^{3}\left( {1 - p_{1}} \right)}^{2}\left( {1 - p_{0}} \right)^{N - 7}} + {\begin{pmatrix}4 \\3\end{pmatrix}\begin{pmatrix}{N - 4} \\2\end{pmatrix}p_{1}^{3}{p_{0}^{2}\left( {1 - p_{1}} \right)}^{2}\left( {1 - p_{0}} \right)^{N - 6}}} & (6)\end{matrix}$

Again, this expression is an upper bound on the actual misidentificationrate because not all words with four ‘1’s are valid code words.

Estimates of the 1-->0 and 0-->1 error rates for each hybridizationround. To compute the probability of misreading a bit at a givenhybridization round, the error correcting properties of the MHD4 codewere used. Briefly, the probabilities of 1-->0 or 0-->1 errors werederived in the following way. Let the probability of making an error atthe ith bit, i.e. ith hybridization round, be p; and the actual numberof RNA molecules of the given species be A, then the number of exactmatches for this RNA will be

$W_{E} = {A{\prod\limits_{i = 1}^{16}\left( {1 - p_{i}} \right)}}$

and the number of one-bit error corrected matches for this RNAcorresponding to errors at the ith bit will be

$W_{i} = {A\frac{p_{i}}{\left( {1 - p_{i}} \right)}{\prod\limits_{j = 1}^{16}{\left( {1 - p_{j}} \right).}}}$

The p_(i) can be directly derived from the ratio:

${W_{i}/W_{E}} = {\frac{p_{i}}{\left( {1 - p_{i}} \right)}.}$

This ratio assumes that the one-bit error-corrected counts were onlygenerated from single-bit errors from the correct word and thatmulti-error contamination from other RNA words is negligible. Given thatthe error rate per hybridization round is small and that it takes atleast three errors to convert one RNA-encoding word into a word thatwould be misidentified as another RNA, the above approximation should bea good one.

To compute the average 1-->0 or 0-->1 error probabilities for each ofthe 16 hybridization rounds, the above approach was used to calculatethe per-bit error rates for each bit of every gene, and these errorswere sorted based on whether they correspond to a 1-->0 or a 0-->1error, and the average of these errors for each bit weighted by thenumber of counts observed for the corresponding gene was taken.

Estimates of the calling rate for individual RNA species from actualimaging data. With the estimates of the 1-->0 or 0-->1 errorprobabilities for each round of hybridization as determined above, it ispossible to estimate the calling rate for each RNA based on the specificword used to encoded it. Specifically, the fraction of an RNA speciesthat is called correctly is determined by

$\begin{matrix}{{{\prod\limits_{i = 1}^{N}\left( {1 - p_{i}} \right)} + {\sum\limits_{j = 1}^{N}{\frac{p_{j}}{\left( {1 - p_{j}} \right)}{\prod\limits_{i = 1}^{N}\left( {1 - p_{i}} \right)}}}},} & (7)\end{matrix}$

where the first term represent the probability of observing an exactmatch of the code word and the second term represent the probability ofobserving an error-corrected match (i.e. with one-bit error). The valuesof the per-bit error rate p_(i) for each RNA species are determined bythe specific code word for that RNA and the measured 1-->0 or 0-->1error rates for each round of hybridization. If the code word of the RNAcontains a ‘1’ in the ith bit, then p_(i) is determined from the 1-->0error rate for the ith hybridization round; if the word contains a ‘0’in the ith bit, p_(i) is determined from the 0-->1 error rate for theith hybridization round.

Hierarchical clustering analysis of the co-variation in RNA abundance.Hierarchical clustering of the co-variation in gene expression for boththe 140-gene and 1001-gene experiments was conducted as follows. First,the distance between every pair of genes was determined as 1 minus thePearson correlation coefficient of the cell-to-cell variation of themeasured copy numbers of these two RNA species, both normalized by thetotal RNA counted in the cell. Thus, highly correlated genes are‘closer’ to one another and highly anti-correlated genes are ‘further’apart. An agglomerative hierarchical cluster tree was then constructedfrom these distances using the Unweighted Pair Group Method withArithmetic mean (UPGMA). Specifically, starting with individual genes,hierarchical clusters were constructed by identifying the two clusters(or individual genes) that are closest to one another according to thearithmetic mean of the distances between all inter-cluster gene pairs.The pairs of clusters (or individual genes) with the smallest distanceare then grouped together and the process is repeated. The matrix ofpairwise correlations was then sorted based on the order of the geneswithin these trees.

Groups of genes with substantial co-variations were identified byselecting a threshold on the hierarchical cluster tree (indicated by thedashed lines in FIGS. 7D and 10A) that produced approximately 10 groupsof genes each of which contains at least 4 members for the 140-geneexperiments or approximately 100 groups each of which contains at least3 members for the 1001-gene experiments. It is noted that one can changethe threshold in order to identify either more tightly coupled smallergroups or larger groups with relatively loose coupling.

A probability value for the confidence that a gene belongs to a specificgroup was determined by computing the difference between the averagecorrelation coefficient between that gene and all other members of thatgroup and the average correlation coefficient between that gene and allother measured genes outside that group. The significance (p-value) ofthis difference was determined with the student's t-test.

Because hierarchical clustering is inherently a one-dimensionalanalysis, i.e. any given genes can only be a member of a single group,this analysis does not allow all correlated gene groups to beidentified. Higher dimension analysis, such as principal componentanalysis or k-means clustering, could be used to identify moreco-varying gene clusters.

Analysis of RNA spatial distributions. To identify genes that havesimilar spatial distributions, each of the measured cells was subdividedinto 2×2 regions and calculated the fraction of each RNA species presentin each of these bins. To control for the fact that some regions of thecell naturally contain more RNA than others, the enrichment wascalculated for each gene, i.e., the ratio of the observed fraction in agiven region for a given RNA species to the average fraction observedfor all genes in that same region. For each pair of RNA species, thePearson correlation coefficient of the region-to-region variation inenrichment of these two RNA species for each cell was determined and thecorrelation coefficients were averaged over ˜400 cells imaged in 7independent data sets. RNA species were then clustered based on theseaverage correlation coefficients using the same hierarchical clusteringalgorithm described above. Because of the large number of cells used forthe analysis, it was found that the coarse spatial binning (2×2 regionsper cell) was sufficient to capture the spatial correlation betweengenes and finer binning did not produce more significantly correlatedgroups.

To measure the distances of genes from the nuclei and from the celledge, brightness thresholds on the cell images were first used tosegment the nuclei and the cell edges identified. The distance fromevery RNA molecule to the nearest part of the nucleus and nearest partof the cell edge was then determined. For each data set, the averagedistance for each RNA species averaged over all the cells measured wasthen computed. These distances were averaged for the group I genes,group II genes or all genes. Only those RNA species with at least 10counts per cell were used in this analysis to minimize statistical erroron the distance values.

Gene ontology (GO) analysis. Groups of genes were selected from thehierarchical trees as discussed above. A collection of GO terms wasdetermined for all measured RNA species as well as the RNA speciesassociated with each group from the most recent human GO annotationsusing both the annotated GO terms and terms immediately upstream ordownstream of the found annotations. The enrichment of these annotationswas calculated from the ratio of the fraction of genes within each groupthat have this term to the fraction of all measured genes that have thisterm and the p-value for this enrichment was calculated via thehypergeometric function. Only statistically significantly enriched GOterms with a p-value less than 0.05 were considered.

While several embodiments of the present invention have been describedand illustrated herein, those of ordinary skill in the art will readilyenvision a variety of other means and/or structures for performing thefunctions and/or obtaining the results and/or one or more of theadvantages described herein, and each of such variations and/ormodifications is deemed to be within the scope of the present invention.More generally, those skilled in the art will readily appreciate thatall parameters, dimensions, materials, and configurations describedherein are meant to be exemplary and that the actual parameters,dimensions, materials, and/or configurations will depend upon thespecific application or applications for which the teachings of thepresent invention is/are used. Those skilled in the art will recognize,or be able to ascertain using no more than routine experimentation, manyequivalents to the specific embodiments of the invention describedherein. It is, therefore, to be understood that the foregoingembodiments are presented by way of example only and that, within thescope of the appended claims and equivalents thereto, the invention maybe practiced otherwise than as specifically described and claimed. Thepresent invention is directed to each individual feature, system,article, material, kit, and/or method described herein. In addition, anycombination of two or more such features, systems, articles, materials,kits, and/or methods, if such features, systems, articles, materials,kits, and/or methods are not mutually inconsistent, is included withinthe scope of the present invention.

All definitions, as defined and used herein, should be understood tocontrol over dictionary definitions, definitions in documentsincorporated by reference, and/or ordinary meanings of the definedterms.

The indefinite articles “a” and “an,” as used herein in thespecification and in the claims, unless clearly indicated to thecontrary, should be understood to mean “at least one.”

The phrase “and/or,” as used herein in the specification and in theclaims, should be understood to mean “either or both” of the elements soconjoined, i.e., elements that are conjunctively present in some casesand disjunctively present in other cases. Multiple elements listed with“and/or” should be construed in the same fashion, i.e., “one or more” ofthe elements so conjoined. Other elements may optionally be presentother than the elements specifically identified by the “and/or” clause,whether related or unrelated to those elements specifically identified.Thus, as a non-limiting example, a reference to “A and/or B”, when usedin conjunction with open-ended language such as “comprising” can refer,in one embodiment, to A only (optionally including elements other thanB); in another embodiment, to B only (optionally including elementsother than A); in yet another embodiment, to both A and B (optionallyincluding other elements); etc.

As used herein in the specification and in the claims, “or” should beunderstood to have the same meaning as “and/or” as defined above. Forexample, when separating items in a list, “or” or “and/or” shall beinterpreted as being inclusive, i.e., the inclusion of at least one, butalso including more than one, of a number or list of elements, and,optionally, additional unlisted items. Only terms clearly indicated tothe contrary, such as “only one of” or “exactly one of,” or, when usedin the claims, “consisting of,” will refer to the inclusion of exactlyone element of a number or list of elements. In general, the term “or”as used herein shall only be interpreted as indicating exclusivealternatives (i.e. “one or the other but not both”) when preceded byterms of exclusivity, such as “either,” “one of,” “only one of,” or“exactly one of.” “Consisting essentially of,” when used in the claims,shall have its ordinary meaning as used in the field of patent law.

As used herein in the specification and in the claims, the phrase “atleast one,” in reference to a list of one or more elements, should beunderstood to mean at least one element selected from any one or more ofthe elements in the list of elements, but not necessarily including atleast one of each and every element specifically listed within the listof elements and not excluding any combinations of elements in the listof elements. This definition also allows that elements may optionally bepresent other than the elements specifically identified within the listof elements to which the phrase “at least one” refers, whether relatedor unrelated to those elements specifically identified. Thus, as anon-limiting example, “at least one of A and B” (or, equivalently, “atleast one of A or B,” or, equivalently “at least one of A and/or B”) canrefer, in one embodiment, to at least one, optionally including morethan one, A, with no B present (and optionally including elements otherthan B); in another embodiment, to at least one, optionally includingmore than one, B, with no A present (and optionally including elementsother than A); in yet another embodiment, to at least one, optionallyincluding more than one, A, and at least one, optionally including morethan one, B (and optionally including other elements); etc.

When the word “about” is used herein in reference to a number, it shouldbe understood that still another embodiment of the invention includesthat number not modified by the presence of the word “about.”

It should also be understood that, unless clearly indicated to thecontrary, in any methods claimed herein that include more than one stepor act, the order of the steps or acts of the method is not necessarilylimited to the order in which the steps or acts of the method arerecited.

In the claims, as well as in the specification above, all transitionalphrases such as “comprising,” “including,” “carrying,” “having,”“containing,” “involving,” “holding,” “composed of,” and the like are tobe understood to be open-ended, i.e., to mean including but not limitedto. Only the transitional phrases “consisting of” and “consistingessentially of” shall be closed or semi-closed transitional phrases,respectively, as set forth in the United States Patent Office Manual ofPatent Examining Procedures, Section 2111.03.

1-176. (canceled)
 177. A method for determining a plurality of differentnucleic acid targets within a sample by in situ hybridization,comprising: (a) contacting a sample comprising a plurality of differentnucleic acid targets with a plurality of primary nucleic acid probes,wherein each primary nucleic acid probe comprises (i) a target sequenceand (ii) one or more read sequences, wherein the target sequencehybridizes to the nucleic acid target in the sample to form a primaryprobe-hybridized complex; (b) contacting the sample to a second round ofhybridization using a plurality of secondary nucleic acid probes whereinthe secondary nucleic acid probes comprise a first portion thathybridizes to a subset of the read sequences of the primary nucleic acidprobe and a second portion comprising a fluorescent label to form asecondary probe-hybridized complex; (c) detecting a fluorescent signalfrom each of the secondary probe-hybridized complexes; (d) removing thesecondary nucleic acid probes or inactivating the fluorescent signalfrom each of the secondary probe-hybridized complexes; and (e) repeatingsteps (b) through (d) with multiple sets of differing secondary nucleicacid probes to generate a pattern of binding of the secondary nucleicacid probes; wherein the pattern of binding of the secondary nucleicacid probes is converted to a codeword, wherein the codeword determinesthe identity of the nucleic acid targets, and wherein the codewordprovides an error correction system.
 178. The method of claim 177,wherein the target sequence of the plurality of primary nucleic acidprobes comprises an average length of between 10 and 200 nucleotides.179. The method of claim 177, wherein the nucleic acid target is RNA.180. The method of claim 179, wherein the method comprises determiningthe transcriptome of a cell.
 181. The method of claim 180, wherein atleast 25% of the transcriptome is determined.
 182. The method of claim177, wherein the plurality of primary nucleic acid probes comprises atleast 10 different primary nucleic acid probes.
 183. The method of claim177, wherein the plurality of primary nucleic acid probes comprises atleast 8 read sequences.
 184. The method of claim 177, wherein each setof the plurality of secondary nucleic acid probes hybridizes to a subsetof the read sequences.
 185. The method of claim 177, wherein the one ormore read sequences have an average length of at least 15 nucleotides.186. The method of claim 177, wherein the fluorescent signal isinactivated by photobleaching.
 187. The method of claim 177, wherein thefluorescent signal is inactivated by chemically or enzymaticallycleaving the fluorescent label from the readout probe-hybridizedcomplexes.
 188. The method of claim 177, wherein the fluorescent signalis detected using a fluorescence imaging technique.
 189. The method ofclaim 177, wherein the one or more read sequences are distributed on theplurality of primary nucleic acid probes so as to define anerror-correcting code.
 190. The method of claim 177, wherein if thepattern of binding of the secondary nucleic acid probes identifies acodeword that does not match a valid codeword, error correction isapplied to the codeword to form a valid codeword.
 191. The method ofclaim 177, wherein the error correction system comprises a Hammingsystem, a Golay code, or an extended Hamming system.
 192. The method ofclaim 177, wherein the error correction system identifies the locationof an error.
 193. The method of claim 177, wherein the error correctionsystem identifies a single bit error.
 194. The method of claim 177,wherein the error correction system identifies a two bit error.
 195. Themethod of claim 177, wherein the error correction system corrects asingle bit error.
 196. The method of claim 191, wherein the plurality ofprimary nucleic acid probes defines a code space with a Hamming distanceof at least
 2. 197. The method of claim 177, wherein each codewordcomprises at least a 14 bit code.
 198. The method of claim 177, whereinthe binding pattern of secondary nucleic acid probes is used todetermine the location of the primary nucleic acid probes within thesample.
 199. The method of claim 177, wherein a given set of differingsecondary nucleic acid probes has differing fluorescent labels.
 200. Themethod of claim 177, wherein multiple sets of differing secondarynucleic acid probes have the same fluorescent label.
 201. The method ofclaim 177, wherein the presence of the fluorescent signal is assigned avalue in the codeword.
 202. The method of claim 177, wherein the absenceof the fluorescent signal is assigned a value in the codeword.
 203. Themethod of claim 177, wherein the determining of the plurality ofdifferent nucleic acid targets within the sample is qualitative and/orquantitative.
 204. The method of claim 177, wherein the determination ofthe plurality of different nucleic acid targets within the sample isspatial.
 205. The method of claim 177, wherein the position of theprimary nucleic acid probe within the sample is determined in at least 2dimensions.
 206. The method of claim 177, wherein the position of theprimary nucleic acid probe within the sample is determined in at least 3dimensions.